|
25 Years of Programming
An open source source for C, C++, OWL, BASIC, MDB, XLS, DOT, and more... |
Home Projects Up Sitemap Search Blog Forum+Chat About Us Privacy Terms of Use Feedback FAQ Images Services Payments Humor Music |
List upcoming BeyondTV scheduled recordings with a Perl script, or with Super-sed in an MSDOS batch fileThe utility program Super-sed (ssed, an enhanced version of sed) and the Perl programming language share an almost magical ability to extract and transform text using regular expressions. This article provides a walkthrough with example code showing how to perform a practical task, extracting and reporting only the desired text from an XML file, using both methods. The example problem to solve: I use the PVR/DVR program SnapStream Beyond TV 3.5.3 to record TV shows to my PC. I use the same PC for everything I do, so I don't keep BTV running all the time. I often want to know when the next scheduled recording will be so I can launch the program before that, or before hibernating my PC so it can record the show while I'm asleep. Launching BTV just to glance at the list is a waste of time. Is there an alternative? BTV saves the list of upcoming recordings in an XML file called Jobs.xml. It's a big data file unsuitable for browsing with a text editor1, but it does contain the information I need. These scripts extract and display only the information I want to see. Unlike most BTV users, I must (for other reasons) schedule recordings, whether recurring or one-time, manually rather than from the OSD Program Guide, which lets me give the programs whatever names I want. An accommodation I made for these scripts was to use a consistent naming convention for the titles, including in them the dates (or day names for recurring shows) and times: the show title of Cold Case, recurring Sunday at 12:35AM is: "Sun 0035 Cold Case", and a non-recurring episode of CSI on April 16th at 8PM was called "0416 2000 CSI". To get a schedule of upcoming recordings, I can run either of the programs listed below. They open an MSDOS console window with a listing that looks like this, enough to decide whether I can turn off the computer or should launch BeyondTV and hibernate it instead: 0416 1500 Typeface It's not likely that many people need to do this particular task(!), but I think the explanation of how the scripts work can help others learn how to use these extremely versatile programs. I like these scripts as examples because they demonstrate the basics generally needed for this type of task and thus should be easily modified to do other things. The basics: Open a file, get its text, transform it somehow using regular expressions, output the result. That's something I need to do nearly every day. BTVSchedule.pl - Perl script#!/usr/bin/perl -WT
# BTVSchedule.pl 4-14-2011 Perl
# Copyright (C)2011 Steven Whitney.
# Initially published by http://25yearsofprogramming.com.
# Published under GNU GPL (General Public License) Version 3,
# with ABSOLUTELY NO WARRANTY.
#
# Lists upcoming BeyondTV scheduled recordings from Jobs.xml.
# If you launch BTV to view the list, it deletes old jobs
# (whether the recordings were made or not),
# so you can't tell if you missed any because you forgot to
# leave BeyondTV running during the time they were on.
# --------------------------------------------------------------------------------
use strict;
use warnings;
my $infile = "C:/Documents and Settings/All Users/Application Data/SnapStream/Beyond TV/Jobs.xml";
open(INFILE, "<", $infile) or die("Jobs.xml not found: ", $!);
my $AllText;
{
local $/; # FILE SLURP MODE, ONLY WITHIN THIS BLOCK
$AllText = <INFILE>;
}
close(INFILE);
$AllText =~ tr/\r//;
$AllText =~ s#</?JobCollection>##gi;
$AllText =~ s#<Job>.*?<Property><Name>Title</Name><Value>(.*?)</Value></Property>.*?</Job>#$1\n#gi;
$AllText =~ s/\nSun /\n1-Sun /gi;
$AllText =~ s/\nMon /\n2-Mon /gi;
$AllText =~ s/\nTue /\n3-Tue /gi;
$AllText =~ s/\nWed /\n4-Wed /gi;
$AllText =~ s/\nThu /\n5-Thu /gi;
$AllText =~ s/\nFri /\n6-Fri /gi;
$AllText =~ s/\nSat /\n7-Sat /gi;
print("\n", join("\n", sort(split(/\n/, $AllText))), "\n");
print("\nPress ENTER to close...");
<STDIN>;
exit(0);
Walkthrough explanation and discussion of this simple Perl programThe first line is called the "shebang" line. The path tells the operating system where to find the program (Perl) that should be used to run this script when you invoke it just by typing its name. The shebang line is not necessary in either Windows or Linux when you invoke the script as an argument to launching Perl itself with the command: perl -WT yourprogram.pl, which is how I usually do it. The switches on the shebang line (-WT, which could also be written -W -T) are significant in both Windows and Linux. -W enables all warnings, which helps to program better. -T turns on Taint checking, which puts Perl on guard with respect to data coming from an untrusted source (such as the user). #!/usr/bin/perl -WT The next two lines are also a best practice. strict generates errors if your code tries to do unsafe things and warnings is similar to the -W switch. They help guide you toward safe coding practices. use strict; use warnings; This just sets a string variable, to the name of the file to read. my is the Perl way of declaring the variable local to the enclosing block or local to the source file if it's not inside a {} block. In Perl, you can and should use forward slashes as the file path separator, even in Windows. If you use backslashes, you must double them: "C:\\path\\". my $infile = "C:/Documents and Settings/All Users/Application Data/SnapStream/Beyond TV/Jobs.xml"; Open the data file. INFILE will be a file handle if open succeeds. "<" means "for reading". ">" would mean "for writing". open returns true if it succeeds and false if it fails. or is a low-precedence equivalent of ||. In Perl, the same as in C, once the result of an OR construct is known to be true, any following tests are omitted. Thus, if open succeeds, the or is not executed, but if open fails and returns false, the or is executed, and die causes the script to exit with an error message. open(INFILE, "<", $infile) or die("Jobs.xml not found: ", $!);
This line declares the variable that will hold the text from Jobs.xml. String, numeric, and boolean variables have a leading $. Arrays are declared with a leading @, and hashes with a leading %. my $AllText; This block reads the text from Jobs.xml into our variable using the "file slurp" method, the whole file in a single gulp. Ordinarily, the angle-brackets <FILEHANDLE> operator reads one line at a time. $/ is a special Perl variable relating to line ends. local $/ changes the file-read behavior, to read the whole file. These two lines are enclosed in a {} block so that local $/ is only in effect within the block. I don't want that behavior anywhere else in the program. Just as in C, it is ok to use a {} block to contain and limit the scope of lines of code. {
local $/; # FILE SLURP MODE, ONLY WITHIN THIS BLOCK
$AllText = <INFILE>;
}
Close the file handle. close(INFILE); Now we start processing the text. The =~ operator is Perl's operator relating to pattern matching. Here, we use it in conjunction with the tr/// operator, which transliterates (converts) all instances of one character to a substitution character. The format is tr/original/replacement/. Our source file is a Windows file whose line ends are CRLF. In order to deal only with linefeeds, we replace all carriage returns (\r) with nothing, thus stripping them out. This line means "strip all the CR out of $AllText". $AllText =~ tr/\r//; The data in the XML file is contained within opening and closing <JobCollection></JobCollection> tags that are of no use, so the next line strips them out with a regular expression match, using the Perl s/// text substitution operator. The format is s/PatternToMatch/Replacement/. The usual delimiter is the forward slash, but any character can be used. That is, s### means the same thing. Using / here would make it necessary to escape (precede) the / in the regular expression with a backslash like this \/ to avoid it being mistaken for a delimiter, so I used # as the delimiter instead. The ? is a regular expression "quantifier" meaning 0 or 1 occurrences of the previous character. Here, it means that the / can occur 0 or 1 times (that is, it's optional), so it will match both the opening and closing XML tags. The replacement text is empty (nothing between the next ##), so the tags are simply removed. After the last # are two options: g means Global (strip all occurrences, not just the first you find), and i means Ignore Case. Putting it all together, this line means "strip every instance of <JobCollection> or </JobCollection> out of $AllText. $AllText =~ s#</?JobCollection>##gi; What's left in $AllText is a list of recording jobs, with many data items for each job stored between its opening and closing <Job></Job> XML tags. All I'm interested in is the show's title, so the next line uses s#pattern#replacement# again to extract it. The line, which explores the complexities of regular expression pattern matching, is color coded for reference. Between the first two ## is the regular expression for identifying the text pattern to match. I want to deal with one recording job at a time. Each one starts with the text <Job> and ends with </Job>, so that's what I specify as the starting and ending character sequences of the regular expression. After the pattern matching engine has identified an occurrence of <Job> as the starting point of a match, I want it to skip over all the text until it reaches the part I'm interested in. .* means a sequence of 0 to any number of characters. The question mark ? makes the expression "non-greedy", which means that it will stop matching characters at the earliest place where the rest of the regular expression can succeed, instead of the last place where it can succeed (that would be greedy!). The next section of text is once again literal text that should be matched. The reason that the non-greedy .*? was important is that I want the matching of it to stop as soon as the engine hits the first occurrence of <Property><Name>..., and not keep running through $AllText until it finds the last occurrence, which would be its default behavior. <Property><Name>... is the opening delimiter (finally!) of the show's title, the part I'm interested in. I match the title using a non-greedy (.*?) expression, except this time it's inside parentheses, which causes the matched text to be "captured", meaning that it is saved by Perl to a numbered variable, which I'll use later. Capturing of the show's title will continue (non-greedily) until the engine reaches the literal text </Value></Property>, which is the closing delimiter for the title's data item in the XML code. After that, the engine uses another non-greedy .*? to skip over all the rest of the text (a bunch more irrelevant XML data items) until it reaches the closing </Jobs> tag, and we're done with the pattern matching. What shall we replace the matched text with? Answer: We replace it with what's between the next two ## characters. That consists of $1, which is the numbered variable into which we saved the show's title (it's $1 because it's the first -- and in this case the only -- parenthesized expression we used), followed by a \n newline character. In effect, we've discarded all the text that was matched except for the title that we wanted. The i flag again forces case-insensitive matching, and the g Global flag causes all the <Job>...</Job> sections to be processed rather than just the first, which would be the default without g. After the searching and replacing, $AllText contains just the list of show titles extracted from the XML file, now separated from each other by newline characters. $AllText =~ s#<Job>.*?<Property><Name>Title</Name><Value>(.*?)</Value></Property>.*?</Job>#$1\n#gi; The day abbreviations Sun Mon Tue... sorted alphabetically won't produce the correct day order, so the next code block prefixes each day with a day-of-week number. Each occurrence of newline (end of the previous line) followed by "Sun " (the start of the current line) is changed to newline followed by "1-Sun ", and so on. $AllText =~ s/\nSun /\n1-Sun /gi; $AllText =~ s/\nMon /\n2-Mon /gi; $AllText =~ s/\nTue /\n3-Tue /gi; $AllText =~ s/\nWed /\n4-Wed /gi; $AllText =~ s/\nThu /\n5-Thu /gi; $AllText =~ s/\nFri /\n6-Fri /gi; $AllText =~ s/\nSat /\n7-Sat /gi; The output line does a lot of work. It splits apart the single string $AllText into an array with a show title in each element, then sorts the elements alphanumerically, then joins them all back together, and prints them to the screen (STDOUT). It's easier to understand if I break it apart into equivalent multiple lines of code: Split apart $AllText, breaking it at each occurrence of a newline (which we added earlier). The newlines themselves are discarded. The result is a list (array). Save the list to the new array variable @temp: my @temp = split(/\n/, $AllText); Sort @temp and save the result (a new separate list) back into @temp: @temp = sort(@temp); Concatenate @temp's array elements, inserting newlines between them, back into one long string and save the result in $AllText: $AllText = join("\n", @temp); The print statement prints to STDOUT by default, and its arguments are a comma-separated list of the items to print, so this means print a newline, then $AllText (which contains its own newlines that we added), then another newline: print(STDOUT "\n", $AllText, "\n"); All done in one line of code with no temporary variables needed: print("\n", join("\n", sort(split(/\n/, $AllText))), "\n");
This script is a console program that runs in a Windows Command Prompt (aka DOS Box). If you launch a console program from a desktop icon or any other Windows graphical GUI location, the console window opens, the program runs, and the window immediately closes again, giving you no opportunity to see the result. We need a way to hold the window open. This line tells you what to do... print("\nPress ENTER to close...");
This code reads a line (possibly containing no text), terminated by <ENTER>, from the console. Thus, the DOS window is held open waiting for you to enter the line. This code is more clear if written as my $a = <STDIN>;, but that generates a compiler warning that you've declared but not used a variable. The assignment isn't needed: <STDIN>; Exit the program normally: exit(0); Installation for useI run this program by double-clicking on a desktop icon, which I created like this (Windows XP):
That's "all"! Seriously, it takes much longer to describe than to do. BTVSchedule.bat - Windows/MSDOS batch file using Super-sedThis was the first version of the script, written while I was learning to use Super-sed. It was a few weeks later that this led me toward Perl, which does everything sed/ssed can do and more. echo off Walkthrough of this batch file and ssed programPerl and ssed are closely related, ssed being a utility program that specializes in one of Perl's capabilities, and they both use the same Perl compatible regular expressions ("PCRE"), so it's no surprise that when the batch file is broken apart into separate lines for clarity, this turns out to be the same program with two minor differences:
There are two versions of this script. The first version listed below invokes ssed to read and perform one transformation on the input text. It then "pipes" (passes) the result text, using the pipe character | to another invocation of ssed to do the next transformation, and so on. When all the sseds are finished, the result text is piped to the MSDOS sort command for sorting and output to the console. pause is a DOS command that prints a message and holds the console window open until you press a key. |
echo off
ssed -Re "{s#</?JobCollection>##gi}" "C:\Documents and Settings\All Users\Application Data\SnapStream\Beyond TV\Jobs.xml" |
ssed -Re "{s#<Job>.*?<Property><Name>Title</Name><Value>(.*?)</Value></Property>.*?</Job>#\1\n#gi}" |
ssed -Re "{s/^Sun /1-Sun /i}" |
ssed -Re "{s/^Mon /2-Mon /i}" |
ssed -Re "{s/^Tue /3-Tue /i}" |
ssed -Re "{s/^Wed /4-Wed /i}" |
ssed -Re "{s/^Thu /5-Thu /i}" |
ssed -Re "{s/^Fri /6-Fri /i}" |
ssed -Re "{s/^Sat /7-Sat /i}" |
sort
echo ----
pause
|
|
This second version of the script, broken apart into separate lines and reformatted for clarity, is slightly more efficient. It invokes ssed only once, with multiple s/// commands between the enclosing {} braces, and it avoids the MSDOS overhead of piping the output from one ssed call to the next. However, it has a serious disadvantage: this version only works properly because Jobs.xml happens to be one long line with no line breaks in it. As a result, the single invocation of ssed slurps the entire file into one line and can perform all its processing on it. If the file had line breaks in it, ssed would 1) read the first line, 2) perform all the actions on it, 3) repeat for each line. This would not have the desired result because the line breaks could occur between any set of XML tags and the regular expression matches might not succeed where they should. Therefore, the first version of the script above, being more fault tolerant, is better. Nonetheless, this example shows that ssed can execute multi-statement programs, which in other contexts is very useful. echo off
ssed -Re "
{
s#</?JobCollection>##gi;
s#<Job>.*?<Property><Name>Title</Name><Value>(.*?)</Value></Property>.*?</Job>#\1\n#gi;
s/^Sun /1-Sun /i;
s/^Mon /2-Mon /i;
s/^Tue /3-Tue /i;
s/^Wed /4-Wed /i;
s/^Thu /5-Thu /i;
s/^Fri /6-Fri /i;
s/^Sat /7-Sat /i;
}" "C:\Documents and Settings\All Users\Application Data\SnapStream\Beyond TV\Jobs.xml" |
sort
echo ----
pause
Notes
|
|
|
|
|
|
Copyright ©2012 Steven Whitney. Last modified Sun 07/29/2012 12:20:50 -0700. |
||