25 Years of Programming
An open source source for C, C++, OWL, BASIC, MDB, XLS, DOT, and more...
Home   Projects   Up   Sitemap   Search   Blog   Forum+Chat   About Us   Privacy   Terms of Use   Feedback   FAQ   Images   Services   Payments   Humor   Music

Online calculator to count occurrences of search engine query strings in website access logs

Only report on page (file) requests that match the following regular expression1.
To report on all files, the box must be blank or contain these two characters: .*
Report visitor queries that occurred at least this many times. Default=1
 

Instructions

Paste lines from your website's HTTP access log into the box above. Click Analyze.

The lines you copy and paste should look like the following from a typical CLF (Combined Log Format) log. Be sure to copy from the correct log type (an HTTP access log, not an FTP log). This example is a referral from Google, with the query underlined. Those are the search terms that brought the user to the page, which is what this calculator extracts and reports:

111.222.333.444 - - [12/Jan/2011:13:27:05 -0700] "GET /blog/20070705.htm HTTP/1.1" 200 79205 "http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=server+hacking+protection" "Mozilla/5.0 (Windows NT 5.1; rv:2.0) Gecko/20100101 Firefox/4.0"

For where to find your HTTP logs and how to download and unzip them, see here.

The only limit to the number of log lines you can paste is whatever JavaScript and your browser can handle4. The more lines you can paste (and the longer the time period it covers), the more comprehensive the report will be.

The report is output into the same box where you pasted the input data. After a few status lines at the top, it is in TAB-separated multi-column format that should paste easily into any spreadsheet program. For easier viewing of the text, you can copy it into a text editor, or, in Firefox 4, use the textarea's drag handle to expand it to fit the text.

Description

Popular web log analyzer programs provide reports of the top queries that people used for finding "your site" at search engines. Sometimes it is more useful to know ALL the queries that people used, and not just for "your site" in general, but for each individual page. That's what this online calculator is for. For every page in your site, it tells you ALL the search queries that brought people to it.

That can help with search engine optimization. It tells you whether your page is ranking for broad queries (one or two search words) or mostly for long-tail queries (multiple search terms, a narrowly focused query). When your page ranks well for broad queries, it generally brings more traffic than if it only ranks for longer, more specific, queries.

The calculator extracts the lines where the referer field contains a search engine query used by a visitor to find the page, decodes the search query to make it readable, and creates a multi-column TAB-separated sorted output:

  1. Filename (the page request). This is first in the sort order, ascending alphanumerically.
  2. Query frequency. Second in the sort order, descending numerically (highest count at top).
  3. Word count of the query phrase. Not used for sorting.
  4. Character count of the query phrase. Not used for sorting.
  5. The query text. Within each group of queries with the same frequency count, the queries in this column are in ascending alphanumeric order.

The word and character counts can be used for creating alternative sort orders in a spreadsheet.

Notes

  1. You can enter a JavaScript style regular expression (mostly the same as a Perl compatible regular expression, PCRE) in this box to limit the filenames for which the report is generated. For example, for a report only about pages ending in .htm or .html, enter \.html?$ in the box. The report will include only pages matching the regular expression.
     
  2. The Case-insensitive box determines how the filter is applied to page (file) names. The box should normally be checked. As an example of how it works, let's say your files have inconsistent names, some with .htm extensions but others with .HTM, capitalized. Using the example regex above:

    \.html?$ with Case-insensitive checked reports on all the .htm and .HTM files.
    \.html?$ with Case-insensitive cleared reports on only the .htm files, and omits .HTM files.
    \.HTML?$ with Case-insensitive cleared reports on only the .HTM files, and omits .htm files.
     
  3. The search engine queries are converted to lower case before being processed, so they are always case-insensitive. Also, multiple consecutive spaces are compressed to a single space. With these rules, the queries "frontpage .htaccess", "FrontPage    .htaccess", and "Frontpage      .htaccess" count as 3 occurrences of the same query.
     
  4. Using Firefox 4 on my computer (WinXP3, 3.2GHz P4, 1GB RAM), the calculator can handle about 32,000 log lines. The limiting part of the operation is pasting the text into the input box.

    Internet Explorer 8 is too slow to be used for this calculator. The most lines it can receive into the input box without spending several minutes in an unresponsive state is about 2,000.
     

Bug reports, feature suggestions, comments, questions can be submitted in the discussion forum.


Search Engine Queries tabulator - Perl script

HTTP logs tend to be very large, and it is only a small percentage of the lines that contain the information this calculator needs. The limitation of doing the task in a browser is that so many lines must be pasted into the box even though most of them are discarded.

That's no problem for this Perl script that you run from the command line. It reads one or more log files and produces the same output as the JavaScript calculator. On my PC, it processes a 130MB log file in about 40 seconds.

In the first column of this example report, the pages are sorted ascending alphabetically (only one page shown below). The most frequent search queries for each page are listed first, and within each count the queries are sorted alphabetically:

PAGEREQUEST	SEARCHCOUNT	WORDS	CHARS	SEARCHSTRING
/blog/20061231.htm	3	2	19	frontpage .htaccess
/blog/20061231.htm	2	4	30	correct htaccess for frontpage
/blog/20061231.htm	2	5	35	front page extensions .htacess file
/blog/20061231.htm	2	2	18	htaccess frontpage
/blog/20061231.htm	2	3	21	mod rewrite frontpage
/blog/20061231.htm	1	3	20	.htaccess front page
/blog/20061231.htm	1	2	19	.htaccess frontpage
/blog/20061231.htm	1	3	30	.htaccess frontpage extensions
/blog/20061231.htm	1	4	29	allow htaccess with frontpage
/blog/20061231.htm	1	6	38	block address in htaccess in frontpage
/blog/20061231.htm	1	8	51	can i still use .htaccess with frontpage ext
/blog/20061231.htm	1	4	36	cpanel htaccess frontpage extensions
/blog/20061231.htm	1	5	33	front page extensions + redirects
/blog/20061231.htm	1	3	22	frontpage and htaccess
/blog/20061231.htm	1	3	30	frontpage extensions .htaccess
/blog/20061231.htm	1	6	32	how to use frontpage with cpanel
/blog/20061231.htm	1	4	26	htaccess deny ip frontpage
/blog/20061231.htm	1	4	34	htaccess with frontpage extensions

You must have the Perl language installed on your PC to run this script. I developed and tested this script with Perl 5.10.0 in Ubuntu Linux and with ActivePerl 5.10.1 in Windows.

Example usage:

perl -WT SearchEngineQueryStrings.pl [options] [logfile1...] [> outfile]

Options:
-r|--regex="REGEXP" Report page requests matching this regex. Default=".*"
                    Example (pages ending in .htm or .html): "\.html?$"
-c|--case-sensitive Makes the -r option case-sensitive
                    so that \.htm$ matches .htm files but not .HTM files.
-m|--mincount=INT   Report queries occurring at least this many times. Default=1
-v|--verbose        Show column headings.
-h|--help           Show this help and exit.

It uses this Perl module:

use Getopt::Long;

The program's code structure and formatting is more like typical C++ formatting than typical Perl formatting.

Download link Description
US$6.00 (25 KB)

The zip file contains 3 versions of the script. The only differences are the line ends:
Linux=LF, Windows=CRLF, Mac=CR. Rename the version you want to use to SearchEngineQueryStrings.pl.

The Buy Now button goes to the PayPal website:

  • If you pay from your PayPal account, PayPal automatically redirects you to my download page.
     
  • If you pay with a credit card, the redirect to my download page is not automatic. You will see a "Return to Merchant" button/link on the last PayPal confirmation page. Click that link to go to my download page.
     
  • If you cancel the transaction at PayPal with their "Cancel and return to [my email address]" link, you will return to this page you are reading now instead of going to the download page.

 

Valid HTML 4.01 Transitional Valid CSS
Yahoo! Search
Search the web Search this site
View content labeling at ICRA.