WebStats.mdb - Microsoft Access 2003 database to import CLF website log files and report statistics
WebStats.mdb is a Microsoft Access 2003 database for analyzing raw web server access logs that are in Combined Log Format
(CLF). It is not designed to be an all-in-one solution to replace
Webalizer or other statistics programs, but it does:
- Generate a few specific reports and statistics that appear to be difficult or impossible to generate with Webalizer.
- Provide the flexibility of having the data stored in a general-purpose database. You can easily create new custom
filters and reports as you need them.
This is the database I use for web statistics reports. I republish it once in a while with whatever new features I've added.
Most of the discussion and instructions below are to help you if you want to see the same type of information that I look at,
by using the same pre-designed queries that I use. However, the database's main feature is that it can import CLF log files. Once
you've done that, you can, for example, delete all my queries and design whatever new ones you want.
It's not a "turnkey" system. Familiarity with Access and knowledge how to use it is needed.
Click here to
download
webstats.zip (59 KB). It contains:
- WebStats.mdb -- the database in Access 2003 format.
- WebStats.doc -- this page of instructions in Word 2003 format.
- Copying.txt -- the GNU GPL license.
Changes in Version 2.0 (3/12/2007):
- CDate is used to interpret the CLF date/time string. CVDate used in Version 1.0 didn't properly handle date calculations
when the log contained entries spanning multiple years.
- Two fundamental queries now request start and end dates so you can obtain a report for a specific time period. All
queries based on those base queries now also request start and end dates.
- There are several new queries and 1 new form.
- You should be able to import your Log table from a Version 1 database into Version 2 with no difficulties.
The questions the database is supposed to answer are:
- How many real, human, people are looking at my site?
- What real, HTML, pages are they viewing?
Real People
A server log file shows all the requests for files that were made to the server. It includes all the requests by your site
visitors, intermingled with all the requests by automated crawlers and by you, the site owner. If your site is new or has
a low amount of traffic, the large number of accesses by crawlers and bots and by yourself during building and testing may
make it difficult to answer the question, "How many real, human, people are actually looking at my site?" So the first purpose
of this database was to allow filtering out web crawlers and yourself so only real human visitors are reported.
Real Pages
The server also logs every request for every file. If an HTML page contains references to 50 small .gif files used on its
navigation bars, then every request for that page will generate 51 log file entries. The second purpose of this database was
to allow defining what is a Real Page, and filter out the rest.
Both of these things can be done to some degree with Webalizer when you run it locally on your own computer and customize
its .conf file. It generates more useful reports locally than it does when run on a remote server where customization of the
.conf file is not allowed.
However, even run locally, some of the reports I wanted to see would:
- require modifying Webalizer filters to get the desired result for one report but with unwanted side effects for other
reports generated during the same run.
- require creating multiple Webalizer .conf files for different purposes, with the difficulties of maintaining multiple
.conf files when modifying settings common to all of them.
So, to get more flexibility, I created this database.
Instructions for use:
- Download your raw access log from your server and unzip it, if necessary.
- Rename the log file so its extension is .txt (.csv, .tab, or .asc would also be ok)
- In WebStats.mdb, click File > Get External Data > Import.
- Select Files Of Type: Text Files, navigate to the file, select it, and click Import.
- At the Security Warning about the Microsoft Office ACWZLIB.MDE wizard (which is a built-in module that Access should
probably know enough to trust), click Open.
- In the Import Text Wizard dialog, click Delimited. Next...
- Select Delimiter = space, and Text Qualifier = ". First row does not contain field names.
Next...
- Store your data in an existing table: Log.
- At the next page, click Finish. You can display Help if you want to, but you don't want the table analyzed.
There will frequently be import errors, saved to a separate table, which you will discover are truncation of Referrer fields
that were longer than 255 characters. This can be resolved by making Referrer a Memo field, but that would make it a non-indexable
field, so I didn't do it.
Customization is required before use
Examine the following queries in Design View before using them:
- The RealPagesByRealPeople query requires customization:
| Host |
You might be able to set criteria for this field to exclude your own visits to your site.
If any part of your own IP address never changes, enter it here in the form of: Not Like "xxx.yyy.*".
This will also filter out anyone else whose IP matches that pattern. If you're with a large ISP, that's not desirable.
The current text in the query, Not Like "000.000.*", is a dummy placeholder to show where the line goes.
The criteria here also specify some robots that I routinely filter out.
|
| Agent |
This field is where you can filter out bots, crawlers, and anyone else you consider not a real
visitor. |
| FileName |
This is where you filter out file extensions you do not consider to be Real Pages. There are no
filters provided here because after importing data I now permanently delete all records that do not refer to Real
Pages. The action query for doing that is described below. |
- The Server Error Report query also has a couple of criteria where you might want to make modifications regarding which
files you want to exclude and which server result codes you consider errors or not.
- You should look at all the other queries, too. Since this is the database I actually use, it is likely I've set a
number of criteria that won't suit your purposes.
Database contents:
Tables:
| Log |
Table designed to allow direct import of CLF log files without any preprocessing.
I have discovered to my surprise that not all CLF log files have the same format. This database was
designed around the Apache log files that I receive from my web host. However, the logs generated by Apache 2 on my
local computer lack the last two fields provided for in this database. It is possible you might need to adjust the
Log table to match the fields provided in your own log files.
|
Log Table Fields:
| Host |
IP address of the computer that made this request to the server |
| RFCId |
RFC 1413 identity of the remote host (usually a dash, meaning not known) |
| UserID |
The REMOTE_USER, the HTTP authenticated login userid of the requester. Usually a dash, or your
own userid. |
| Time |
Date and time (server time) |
| Zone |
Time zone (where the server is located) |
| File |
The HTTP request line, containing the filename requested |
| Result |
The server status result code |
| Bytes |
Number of bytes actually sent by the server |
| Referrer |
The web page containing the link that was followed to this page of your site |
| Agent |
User-Agent info about the remote host browser and operating system |
Queries
The most useful ones are highlighted in yellow.
The queries that have {} around their descriptions are data sources for other queries, and are otherwise not interesting.
| AllPagesAccessedByOneHostIPA |
Find all the accesses attempted from a given IP address. |
| Browser Tabulation |
Crude tabulation of browsers & versions. Set the ones you want to recognize in Module1. |
| Daily Stats - Real People Real Pages (Chart Source) |
Totals for each day: Unique human visitors and Real pages
viewed. |
| Delete Trivial Files GIF JPG PNG CSS |
PERMANENTLY DELETE tiny files that are not real pages, to reduce database size. Set your criteria
carefully. Preview before deleting records. |
| Hack Attempts |
Possible attempts to hack the server. (If you discover a successful hack, see
Step
By Step Repair After a Website Hack and How To Prevent It.) |
| Last Access Date For Each Page |
The most recent access date for each of your site pages. |
| PagesViewedPerHost |
{Number of pages viewed by each unique visitor (IP).} |
| PagesViewedPerHost - Average, StdDev |
Average number of pages viewed by each unique visitor (IP),
and standard deviation. |
| PagesViewedPerHost-Tabulation |
How many visitors viewed 1 page, how many viewed 2, ... |
| PopularPages-RealPeopleOnly |
The most popular pages on your site, with the count of
PageViews of each. |
| PopularPages-RealPeopleOnly-BySearchString |
The most popular pages whose path+filenames match a substring you enter. |
| Raw Log in Chronological Order |
Raw log, no filtering, in chronological order. |
| Real Visitors and Real Page Views - Tabulation By Month |
Total Unique Visitors and Total PageViews for each MONTH. |
| Real Visitors and Real Page Views - Tabulation By Month, Day |
Total Unique Visitors and Total PageViews for each DATE. |
| RealPagesByRealPeople |
Chronological Log of Real Pages by Real People. A source
for several other queries. It is like cpanel's Latest Visitors report, but filtered. |
| RealPagesByRealPeopleSortedByFile |
For each file on your site, shows all accesses of it, in chronological order. |
| RealPagesByRealPeopleSortedByHost |
For each unique Visitor (IP), shows a chronological log of their visits, and what they viewed. |
| Redirects |
Page requests resulting in code 301 or 302 redirects |
| Server Error Report |
Server access errors to investigate. |
| Unique Agent Strings |
{Tabulates the incidence count of each unique Agent string and extracts the Browser used.} |
| Unique Visitors - By Month |
{For each MONTH, shows all the Unique Visitors (IPs) and how many pages each viewed.} |
| Unique Visitors - By Month, Day |
{For each DATE, shows all the Unique Visitors (IPs) and how many pages each viewed.}
|
Form
| Daily Stats - Real People Real Pages |
2-series line chart tracking daily a) Real Visitors and
b) PageViews of real pages |
| PagesViewedPerHost - Tabulation |
Bar chart showing how many visitors viewed 1 page, how many viewed 2,... |
|