|
25 Years of Programming
An open source source for C, C++, OWL, BASIC, MDB, XLS, DOT, and more... |
Home Projects Up Sitemap Search Blog Forum+Chat About Us Privacy Terms of Use Feedback FAQ Images Services Payments Humor Music |
Perl script to extract and list hyperlinks from filesThe script below is a short program that uses Perl HTML::LinkExtor to parse and list the hyperlinks in one or more files given on the command line. It's surprising how much can be done with only a few lines of Perl code. I'm new to Perl, and don't claim this to be the best or most efficient way to do this task. Maybe the code could be reduced to even fewer lines. |
#!/usr/bin/perl -wT
# ExtractLinks.pl 8-10-2010
# Extracts and prints a list of the hyperlinks from the files listed on command line.
#
# Copyright (C)2010 Steven Whitney.
# Initially published by http://25yearsofprogramming.com.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
#
# Example simple use:
# perl -wT ExtractHyperlinks.pl index.htm index2.htm...
#
# Multiple files example in Linux:
# I used the find command because if filenames on the perl command line contain spaces,
# I believe their parts would be treated as separate files. find seems to solve that potential problem.
# The path to the .pl must be explicit, but the output redirection will go to the folder
# you are logged into when you launch the script:
# find ./ -iregex '.*\.htm' -execdir perl -wT /home/user/path/ExtractHyperlinks.pl '{}' >> AllHyperlinks.txt \;
#
# As an example of some post-processing that can be done on the output file,
# the following removes the HTML tag and attribute name that normally precede the hyperlink in the output;
# then it selects only the lines that refer to files with an ".htm" extension;
# then it removes all occurrences of "../" so that "filename", "../filename", and "../../filename" won't be
# treated as 3 separate files when they're really the same file;
# then it sorts the list and reduces it to a list of unique filenames:
# ssed -Re '{s/[^ ]+ [^ ]+ (.*)/\1/}' AllHyperlinks.txt | grep -iP '\.htm' | ssed -Re '{s/\.\.\///gi}' | sort -u > AllHyperlinksSortedUnique.txt
#
# Multiple files example in Windows:
# For multiple files in Windows, an alternative to "find" must be used, such as:
# for %f in (*.htm) do perl -wT ExtractHyperlinks.pl %f
# There are Windows versions of both ssed and grep, but Windows sort is not very flexible.
# Manually sorting the file in the Notepad++ editor works well.
use strict;
use warnings;
use HTML::LinkExtor; # load the needed module
my $p = HTML::LinkExtor->new(); # create a link extractor object
foreach my $s(@ARGV) # iterate over the command line arguments (filenames)
{
# LinkExtor builds and stores the list of links internally
$p->parse_file($s) or die("\nOutput incomplete. Input file not found: $s\n");
foreach($p->links) # now iterate over the hyperlink list and print each
{
print "@$_\n"; # each list element is an array that must be dereferenced
}
}
|
|
|
|
|
|
|
Copyright ©2010 Steven Whitney. Last modified Thu 10/21/2010 02:08:03 -0700. |
||