|
|
WebStats.mdb - CLF web site log statistics
WebStats.mdb is a Microsoft Access 2003 database for analyzing raw web server
access logs that are in Combined Log Format (CLF). It is not designed to be an
all-in-one solution to replace
Webalizer or other statistics programs, but it does:
- Generate a few specific reports and statistics that appear to be
difficult or impossible to generate with Webalizer.
- Provide the flexibility of having the data stored in a general-purpose
database. You can easily create new custom filters and reports as you need
them.
This is the database I use for web statistics reports. It is a work in
progress with new features added as I need them.
Click here to
download
webstats.zip (59 KB). It contains:
- WebStats.mdb -- the database in MS Access 2003 format.
- WebStats.doc -- this page of instructions in MS Word 2003
format.
- Copying.txt -- the GNU GPL license.
Changes in Version 2.0 (3/12/2007):
- CDate is used to interpret the CLF date/time string. CVDate used in
Version 1.0 didn't properly handle date calculations when the log contained
entries spanning multiple years.
- Two fundamental queries now request start and end dates so you can
obtain a report for a specific time period. All queries based on those base
queries now also request start and end dates.
- There are several new queries and 1 new form.
- You should be able to import your Log table from a Version 1 database
into Version 2 with no difficulties.
The questions the database is supposed to answer are:
- How many real, human, people are looking at my site?
- What real, HTML, pages are they viewing?
Real People
A server log file shows all the requests for files that were made to the
server. It includes all the requests by your site visitors, intermingled with
all the requests by automated crawlers and by you, the site owner. If your site
is new or has a low amount of traffic, the large number of accesses by crawlers
and bots and by yourself during building and testing may make it difficult to
answer the question, "How many real, human, people are actually looking at my
site?" So the first purpose of this database was to allow filtering out web crawlers and yourself so only real human
visitors are reported.
Real Pages
The server also logs every request for every file. If an HTML page contains
references to 50 small .gif files used on its navigation bars, then every
request for that page will generate 51 log file entries. The second purpose of
this database was to allow defining what is a Real Page, and filter out the
rest.
Both of these things can be done to some degree with Webalizer when you run
it locally on your own computer and customize its .conf file. It generates more useful reports
locally than it does when run on a remote server where customization of the
.conf file is not allowed.
However, even run locally, some of the reports I wanted to see would:
- require modifying Webalizer filters to get the
desired result for one report but with unwanted side effects for other
reports generated during the same run.
- require creating
multiple Webalizer .conf files for different purposes, with the difficulties of maintaining multiple .conf files when modifying
settings common to all of them.
So, to get more flexibility, I created this database.
Instructions for use:
- Download your raw access log from your server and unzip it,
if necessary.
- Rename the log file so its extension is .txt (.csv, .tab,
or .asc would also be ok)
- In WebStats.mdb, click File > Get External Data > Import.
- Select Files Of Type: Text Files, navigate to the file,
select it, and click Import.
- At the Security Warning about the Microsoft Office
ACWZLIB.MDE wizard (which is a built-in module that Access should probably know
enough to trust), click Open.
- In the Import Text Wizard dialog, click Delimited. Next...
- Select Delimiter = space, and Text Qualifier = ". First
row does not contain field names. Next...
- Store your data in an existing table: Log.
- At the next page, click Finish. You can display Help if you
want to, but you don't want the table analyzed.
There will frequently be import errors, saved to a separate table, which you
will discover are truncation of Referrer fields that were longer than 255
characters. This can be resolved by making Referrer a Memo field, but that
would make it a non-indexable field, so I didn't do it.
Customization is required before use
Examine the following queries in Design View before using them:
- The RealPagesByRealPeople query requires customization:
| Host |
You might be able to set criteria for this field to
exclude your own visits to your site. If any
part of your own IP address never changes, enter it here in the form of:
Not Like "xxx.yyy.*". This will also filter out anyone else whose IP
matches that pattern. If you're with a large ISP, that's not desirable.
The current text in the query, Not Like "000.000.*", is a dummy
placeholder to show where the line goes.
The criteria here also specify some robots that I
routinely filter out. |
| Agent |
This field is where you can filter out bots,
crawlers, and anyone else you consider not a real visitor. |
| FileName |
This is where you filter out file extensions you do
not consider to be Real Pages. There are no filters provided here
because after importing data I now permanently delete all records that
do not refer to Real Pages. The action query for doing that is described
below. |
- The Server Error Report query also has a couple of criteria where you
might want to make modifications regarding which files you want to exclude
and which server result codes you consider errors or not.
- You should look at all the other queries, too. Since this is the
database I actually use, it is likely I've set a number of criteria that
won't suit your purposes.
Database contents:
Tables:
|
Log |
Table designed to allow direct import of CLF log files without
any preprocessing. I have discovered to my surprise that
not all CLF log files have the same format. This database was designed around
the Apache log files that I receive from my web host. However, the logs
generated by Apache 2 on my local computer lack the last two fields provided for
in this database. It is possible you might need to adjust the Log table to match
the fields provided in your own log files. | Log Table Fields:
|
Host |
IP address of the computer that made this request to the server |
|
RFCId
|
RFC 1413 identity of the remote host (usually a dash, meaning not known) |
|
UserID
|
The REMOTE_USER, the HTTP authenticated login userid of the requester. Usually a dash, or your own userid. |
|
Time
|
Date and time (server time) |
|
Zone
|
Time zone (where the server is located) |
|
File
|
The HTTP request line, containing the filename requested |
|
Result
|
The server status result code |
|
Bytes
|
Number of bytes actually sent by the server |
|
Referrer
|
The web page containing the link that was followed to this page of your site |
|
Agent
|
User-Agent info about the remote host browser and operating system
|
Queries
The most useful ones are highlighted in yellow.
The queries that have {} around their
descriptions are data sources for other queries, and are otherwise not
interesting.
|
AllPagesAccessedByOneHostIPA |
Find all the accesses attempted from a given IP address. |
|
Browser Tabulation |
Crude tabulation of browsers & versions. Set the ones you want to
recognize in Module1. |
|
Daily Stats - Real People Real Pages (Chart Source) |
Totals for each day: Unique human visitors and Real pages viewed. |
|
Delete Trivial Files GIF JPG PNG CSS |
PERMANENTLY DELETE tiny files that are not real pages, to reduce
database size. Set your criteria carefully. Preview before deleting records. |
|
Hack Attempts |
Possible attempts to hack the server. (If you discover a
successful hack, see
Step By Step Repair After a Website Hack and How To Prevent It.) |
|
Last Access Date For Each Page |
The most recent access date for each of your site pages. |
|
PagesViewedPerHost |
{Number of pages viewed by each unique visitor (IP).} |
|
PagesViewedPerHost - Average, StdDev |
Average number of pages viewed by each unique visitor (IP), and
standard deviation. |
|
PagesViewedPerHost-Tabulation |
How many visitors viewed 1 page, how many viewed 2, ... |
|
PopularPages-RealPeopleOnly |
The most popular pages on your site, with the count of PageViews
of each. |
|
PopularPages-RealPeopleOnly-BySearchString |
The most popular pages whose path+filenames match a substring you
enter. |
|
Raw Log in Chronological Order |
Raw log, no filtering, in chronological order. |
|
Real Visitors and Real Page Views - Tabulation By Month |
Total Unique Visitors and Total PageViews for each MONTH. |
|
Real Visitors and Real Page Views - Tabulation By Month, Day |
Total Unique Visitors and Total PageViews for each DATE. |
|
RealPagesByRealPeople |
Chronological Log of Real Pages by Real People. A source for
several other queries. It is like cpanel's Latest Visitors report, but filtered. |
|
RealPagesByRealPeopleSortedByFile |
For each file on your site, shows all accesses of it, in
chronological order. |
|
RealPagesByRealPeopleSortedByHost |
For each unique Visitor (IP), shows a chronological log of their
visits, and what they viewed. |
|
Redirects |
Page requests resulting in code 301 or 302 redirects |
|
Server Error Report |
Server access errors to investigate. |
|
Unique Agent Strings |
{Tabulates the incidence count of each unique Agent string and
extracts the Browser used.} |
|
Unique Visitors - By Month |
{For each MONTH, shows all the Unique Visitors (IPs) and how many
pages each viewed.} |
|
Unique Visitors - By Month, Day |
{For each DATE, shows all the Unique Visitors (IPs) and how many
pages each viewed.} |
Form
|
Daily Stats - Real People Real Pages |
2-series line chart
tracking daily a) Real Visitors and b) PageViews of real pages |
|
PagesViewedPerHost - Tabulation |
Bar chart showing how many visitors viewed 1 page, how many
viewed 2,... |
|