|
25 Years of Programming
An open source source for C, C++, OWL, BASIC, MDB, XLS, DOT, and more... |
Home Projects Up Sitemap Search Blog Forum+Chat About Us Privacy Terms of Use Feedback FAQ Images Services Payments Humor Music |
MySQL database for linkchecker broken hyperlink resultsThe linkchecker program checks websites or local groups of files for broken hyperlinks. Its command line version has several output options. One of them is for an SQL script with INSERT statements to insert the rows containing the crawl results into an SQL database. The linkchecker source code package contains (except in at least one version of the Ubuntu .deb, from which it seems to have been inadvertently omitted) a sample generic (or PostgreSQL) SQL script for creating just the needed database table. The script below is specifically for MySQL and is more complete, creating both the database and its table, with an added auto-increment `id` primary key field. I also made some of the text fields large to store longer paths. If the author/maintainer of linkchecker runs across this and would like to incorporate this MySQL version of the script into the linkchecker package, they can feel free to do so. Instructions
|
Use the FTP feature of wget to download all files matching a pattern (such as certain text files only, in the example):
wget --ftp-user=USERID --ftp-password='PASSWORD' -oWget.log -r --level=10 -w0 --accept cfm,css,htm,html "ftp://EXAMPLE.com/public_html/"
Use linkchecker to check for broken links in all files in the current directory (on the local PC) and below, but ignoring certain URL types, writing the output separately to a .csv file (linkchecker-out.csv) and also to an .sql file (linkchecker-out.sql):
linkchecker --ignore-url=^mailto: --ignore-url=^irc:// --ignore-url=^http://www\.EXAMPLE\.com --ignore-url=^http://EXAMPLE\.com --ignore-url=^ftp:// --ignore-url=^https:// -Fsql -Fcsv --no-warnings *
In MySQL, view or delete rows that you aren't interested in but forgot to exclude in wget. These are just a couple of different examples, showing use of REGEXP and NOT REGEXP, and a variation where $ is used to mark the end of the string. It means "the line ends with .htm":
-- View the files that don't end with .htm
-- However, note that filename.htm#bookmark is an .htm link, but it doesn't end with .htm.
SELECT urlname
FROM linksdb
WHERE `urlname` NOT REGEXP '\\.(htm)$';
-- Delete rows where the filename contains . followed by any of the extensions listed.
DELETE
FROM linksdb
WHERE `urlname` REGEXP '\\.(cfg|dhtml|inf|ini|pl|rdf)';
In MySQL, the following creates an output format that is good for browsing the results. In the MySQL Command Line Client, you can use tee/notee to echo the output to a text file. The text file can be imported to Excel or OpenOffice.org Calc if you tell the import wizard that the separator character is a "|" (vertical line). An unnecessary leftmost column will be created, which you can delete:
SELECT `result`, `parentname`, `line`, `col`, `urlname`
FROM linksdb
ORDER BY `result`, `parentname`, `line`, `col`;
/* CreateLinkChecker.sql 8-18-2010 MySQL version. Creates database for the linkchecker program output data. Copyright (C)2010 Steven Whitney Initially published by http://25yearsofprogramming.com This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. */ WARNINGS; SET FOREIGN_KEY_CHECKS = 0; DROP DATABASE IF EXISTS `linkchecker`; -- BINARY COLLATION MAKES THE HYPERLINKS CASE-SENSITIVE BY DEFAULT WHEN SORTING OR FILTERING. CREATE DATABASE IF NOT EXISTS `linkchecker` CHARACTER SET latin1 COLLATE latin1_bin; USE `linkchecker`; -- linksdb is the table name the linkchecker program uses in its .sql file output DROP TABLE IF EXISTS `linkchecker`.`linksdb`; CREATE TABLE `linkchecker`.`linksdb` ( `id` INT UNSIGNED NOT NULL AUTO_INCREMENT, `urlname` VARCHAR(1024) NOT NULL, `recursionlevel` INT NOT NULL, `parentname` VARCHAR(1024), `baseref` VARCHAR(1024), `valid` INT, `result` VARCHAR(1024), `warning` VARCHAR(1024), `info` VARCHAR(1024), `url` VARCHAR(1024), `line` INT, `col` INT, `name` VARCHAR(1024), `checktime` INT, `dltime` INT, `dlsize` INT, `cached` INT, PRIMARY KEY (`id`) ) ENGINE = INNODB; SET FOREIGN_KEY_CHECKS = 1; DESCRIBE `linksdb`;
|
|
|
|
|
|
|
Copyright ©2011 Steven Whitney. Last modified Mon 09/19/2011 03:44:48 -0700. |
||