25 Years of Programming
An open source source for C, C++, OWL, BASIC, MDB, XLS, DOT, and more...
Home   Projects   Up   Sitemap   Search   Blog   Forum+Chat   About Us   Privacy   Terms of Use   Feedback   FAQ   Images   Services   Payments   Humor   Music

MySQL database for linkchecker broken hyperlink results

The linkchecker program checks websites or local groups of files for broken hyperlinks. Its command line version has several output options. One of them is for an SQL script with INSERT statements to insert the rows containing the crawl results into an SQL database. The linkchecker source code package contains (except in at least one version of the Ubuntu .deb, from which it seems to have been inadvertently omitted) a sample generic (or PostgreSQL) SQL script for creating just the needed database table.

The script below is specifically for MySQL and is more complete, creating both the database and its table, with an added auto-increment `id` primary key field. I also made some of the text fields large to store longer paths.

If the author/maintainer of linkchecker runs across this and would like to incorporate this MySQL version of the script into the linkchecker package, they can feel free to do so.

Instructions

  1. Run linkchecker to get its linkchecker-out.sql output file.
  2. Log into the MySQL Command Line Client:
    mysql -uroot -p
  3. Run the CreateLinkChecker.sql script below to create the database and table. Provide the full path to the script. Even in Windows, use forward slash path separators rather than the usual backslashes:
    source full/path/to/file/CreateLinkChecker.sql;
  4. Run the linkchecker output script to import the data into the database:
    source full/path/to/file/linkchecker-out.sql;
  5. Then you can work with the database in MySQL.

Example commands possibly helpful when working with linkchecker

Use the FTP feature of wget to download all files matching a pattern (such as certain text files only, in the example):

wget --ftp-user=USERID --ftp-password='PASSWORD' -oWget.log -r --level=10 -w0 --accept cfm,css,htm,html "ftp://EXAMPLE.com/public_html/"

Use linkchecker to check for broken links in all files in the current directory (on the local PC) and below, but ignoring certain URL types, writing the output separately to a .csv file (linkchecker-out.csv) and also to an .sql file (linkchecker-out.sql):

linkchecker --ignore-url=^mailto: --ignore-url=^irc:// --ignore-url=^http://www\.EXAMPLE\.com --ignore-url=^http://EXAMPLE\.com --ignore-url=^ftp:// --ignore-url=^https:// -Fsql -Fcsv --no-warnings *

In MySQL, view or delete rows that you aren't interested in but forgot to exclude in wget. These are just a couple of different examples, showing use of REGEXP and NOT REGEXP, and a variation where $ is used to mark the end of the string. It means "the line ends with .htm":

-- View the files that don't end with .htm
-- However, note that filename.htm#bookmark is an .htm link, but it doesn't end with .htm.
SELECT urlname
FROM linksdb
WHERE `urlname` NOT REGEXP '\\.(htm)$';

-- Delete rows where the filename contains . followed by any of the extensions listed.
DELETE
FROM linksdb
WHERE `urlname` REGEXP '\\.(cfg|dhtml|inf|ini|pl|rdf)';

In MySQL, the following creates an output format that is good for browsing the results. In the MySQL Command Line Client, you can use tee/notee to echo the output to a text file. The text file can be imported to Excel or OpenOffice.org Calc if you tell the import wizard that the separator character is a "|" (vertical line). An unnecessary leftmost column will be created, which you can delete:

SELECT `result`, `parentname`, `line`, `col`, `urlname`
FROM linksdb
ORDER BY `result`, `parentname`, `line`, `col`;

CreateLinkChecker.sql

/*
CreateLinkChecker.sql	8-18-2010
MySQL version. Creates database for the linkchecker program output data.
Copyright (C)2010 Steven Whitney
Initially published by http://25yearsofprogramming.com

This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
as published by the Free Software Foundation; either version 2
of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.

*/
WARNINGS;
SET FOREIGN_KEY_CHECKS = 0;

DROP DATABASE IF EXISTS `linkchecker`;
-- BINARY COLLATION MAKES THE HYPERLINKS CASE-SENSITIVE BY DEFAULT WHEN SORTING OR FILTERING.
CREATE DATABASE IF NOT EXISTS `linkchecker`
	CHARACTER SET latin1 COLLATE latin1_bin;
USE `linkchecker`;

-- linksdb is the table name the linkchecker program uses in its .sql file output
DROP TABLE IF EXISTS `linkchecker`.`linksdb`; 
CREATE TABLE `linkchecker`.`linksdb` 
(
	`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
	`urlname`        VARCHAR(1024) NOT NULL,
	`recursionlevel` INT NOT NULL,
	`parentname`     VARCHAR(1024),
	`baseref`        VARCHAR(1024),
	`valid`          INT,
	`result`         VARCHAR(1024),
	`warning`        VARCHAR(1024),
	`info`           VARCHAR(1024),
	`url`            VARCHAR(1024),
	`line`           INT,
	`col`            INT,
	`name`           VARCHAR(1024),
	`checktime`      INT,
	`dltime`         INT,
	`dlsize`         INT,
	`cached`         INT,
	PRIMARY KEY (`id`)
)
ENGINE = INNODB;

SET FOREIGN_KEY_CHECKS = 1;

DESCRIBE `linksdb`;

 


 

Valid HTML 4.01 Transitional Valid CSS
Yahoo! Search
Search the web Search this site
View content labeling at ICRA.