25 Years of Programming
An open source source for C, C++, OWL, BASIC, MDB, XLS, DOT, and more...
Home   Projects   Up   Sitemap   Search   Blog   Forum+Chat   About Us   Privacy   Terms of Use   Feedback   FAQ   Images   Services   Payments   Humor   Music  

MySQL database import scripts for the Moby Word Lists, Parts of Speech, and Thesaurus

The MobyWordListsMySQL project consists of a Perl script and an SQL script that work together to import Grady Ward's Moby Project Word Lists, Parts of Speech, and Thesaurus into MySQL database tables.

The scripts are tested in Linux and Windows, and should also work on other operating systems, possibly with minor modifications.

The documentation about the problems I encountered and how they were resolved is intentionally extremely detailed so that it might serve as a guide to solving similar import problems in other projects, even ones having nothing to do with Moby.  

Requirements:

  • MySQL. If you're reading this, you probably already have it, but if not, I've written an introduction about using it.
  • MySQL permissions (at least temporary ones), on both the MySQL Server side and the MySQL Command Line Client side, to use the LOAD DATA LOCAL INFILE statement, discussed below, with more detailed explanation in the MySQL Manual. If you don't have permission for this by default on the MySQL Server side, you will need the ability to log into MySQL as the MySQL 'root' user to change the setting. 
  • The Moby Project zip files from Project Gutenberg. They contain the word lists as plain text data files with MSDOS-style carriage-return+line-feed (CRLF) line ends. These are the source data files for the scripts below. When I first published this project, all the files were in one zip download at Project Gutenberg, but they have since been split into separate downloads. You'll need these files:
    Moby Part of Speech List by Grady Ward
    Moby Thesaurus List by Grady Ward
    Moby Word Lists by Grady Ward
  • Perl. In Linux, you already have it or can install it from your distribution's repository. I developed and tested the PrepMoby.pl script in Ubuntu with Perl 5.10.0. In Windows, you can install the free ActivePerl. I tested the script with version 5.10.1.
  • About 150MB(?) of free disk space for the MySQL tables and indexes. This is a large database, with nearly 4 million total rows.
  • The mobypos-ascii-additions.txt file (description and link below).

Most of the imports are simple ones accomplished by the .sql script below, but two of the data files require preprocessing, which is done by the Perl script:

1) The Moby POS (Parts of Speech) file:

Most of the word list files are in pure 7-bit ASCII character encoding, which means they contain only the ASCII characters 0-127 (0x00-0x7F hex), with no accented letters or symbols from languages other than English. mobypos.txt is different. It has 1,231 words with characters that are not in the standard ASCII set (i.e. they are above ASCII 127). When these words are imported into the Parts of Speech table as-is, they become difficult to search for if your keyboard doesn't have keys for those characters. Also, MySQL won't consider them to match words in the other word lists where they don't have the accented characters. For example, a search for "facade" will fail if it is in the database as "façade".

There is a second problem, the character encoding. Although the file has MSDOS-style CRLF line ends like the other files, its character encoding is not DOS or Windows. It is Macintosh MacOS MacRoman, which has different character mappings from MySQL latin1, so the non-ASCII characters display as the completely wrong characters. 

The PrepMoby.pl script does half the needed conversion. It converts from MacRoman to Windows 1252 (which is the same as latin1), so at least the accented characters display as the correct characters. 

When you use the MySQL Command Line Client in Windows, your DOS window is actually using the old DOS "code page" #437, and the non-ASCII characters still won't display properly. Before launching mysql, type this DOS command, which will switch to the correct Windows 1252 character set: CHCP 1252.  

To solve the problem of searching/matching, I decided an additional workaround would be appropriate: put the affected 1231 words in the database in both accented and unaccented forms to allow searching for either. That requires making a second import-ready file containing the unaccented versions. The Linux iconv utility can do this type of "transliterating" conversion, substituting "é" with "e", etc. The code for doing that is in the comments of the CreateMobyDatabase.sql script. However, I know no way to do it in Windows, so I created a file that already contains the converted 1231 words. 

If you use Windows, you will need to download the file, since the SQL script assumes you have it. If you use Linux, you can create the file yourself, or download it to save time:

Right-click this link mobypos-ascii-additions.txt (15 KB) and then select Save As... to save the file directly to disk. It has Linux-style LF line ends which must be preserved because that's what the SQL script expects in both Linux and Windows. If you save the file any other way in Windows, the line ends will be changed to CRLF, and the SQL script will report errors during import.

2) The Moby Thesaurus file, mthesaur.txt, has data format errors and in addition has a structure that doesn't make much sense for use as a database table. The Perl script corrects the errors and transforms the structure. Exactly what it does is described in the script comments.

The following paragraphs about the thesaurus preprocessing mention some tools that are useful in a project like this: 

My first attempt started well in Ubuntu, using ssed (an enhanced version of Linux sed, with a Windows version available) and the Linux sort command. However, one of the project goals was that the same methods should work in both Linux and Windows XP, and the first attempt hit a dead end because the Windows sort program is case insensitive, with no good workaround. 

Later, I did find an alternative in Windows. The Notepad++ text editor has a plugin called TextFX. The sort-and-cull-to-uniques step (noted below in the code) of 2.5 million text lines (26MB), which took 20 minutes in Ubuntu Perl and 5 hours in Windows Perl (before crashing) was accomplished by Notepad++ TextFX in an astonishing 5 seconds!

During the first attempt, I also discovered that the file's needed structural transformation (shown in the code below) didn't seem possible with ssed. So I started learning Perl, enough to do that one step. When that was done, the project "only" required: ssed, Linux sort, and Perl, which was a bit much to expect anyone to run around and gather up, and it still didn't work in Windows because it couldn't do a case-sensitive sort. So that led to learning enough Perl to do the whole procedure with it. 

Instructions

  1. Obtain Perl if you don't already have it.
  2. Download and unzip the Moby Word List files from Project Gutenberg. Put all the unzipped text files into a directory containing only those files.
  3. Copy and paste the text of the two scripts below, to create text files in that same directory, using the names shown in the headings.
  4. Open a Terminal (Linux) or a Command Prompt (Windows).
  5. Use the cd command (in both Linux and Windows) to change directory to the one where you put the files.
  6. Run the Perl script with this line. It will take a couple or a few minutes:
    perl -wT PrepMoby.pl
  7. Launch the MySQL Command Line Client with:
    mysql --local-infile -uroot -p
    (enter your password when prompted)
    The --local-infile option gives you permission on the Client side to use the LOAD DATA LOCAL INFILE statement.
  8. Ensure you have Server side permissions to use LOAD DATA LOCAL INFILE:
    In the Client, enter this query:
    show variables like 'local_infile';
    If the result looks like this, with local_infile = OFF,
    +---------------+-------+
    | Variable_name | Value |
    +---------------+-------+
    | local_infile  | OFF   |
    +---------------+-------+

    you'll need to change the setting with this statement:
    set global local_infile=ON;
    Expected result:
    mysql> show variables like 'local_infile';
    +---------------+-------+
    | Variable_name | Value |
    +---------------+-------+
    | local_infile  | ON    |
    +---------------+-------+

     
  9. Run the SQL script with this line. It can take 30 minutes or more to complete.
    source CreateMobyDatabase.sql;
    Each DROP TABLE command in the script will issue a harmless error that the table doesn't exist. Watch the printout for any other errors or warnings. There should be none.
  10. If you needed to change server-side permissions to allow LOAD DATA LOCAL INFILE, restore the setting to its previous value with:
    set global local_infile=OFF;

This project has no Microsoft Access version. If you want to work with the tables in Access, I'd suggest creating the database and tables in MySQL, then creating another Moby database in Access. In Access Moby, use ODBC to create links to the MySQL tables. This gives you the speed and large table capability of MySQL with the feature-rich Access interface. The alternative is to import the text files into Access directly, but some of these tables are so large that I have doubts whether Access can handle them. I haven't tried it. 

I occasionally revise the scripts below. The "version number" is the date at the top of each file.


All-in-one version (everything needed provided)

If you just want to get the database set up as quickly and easily as possible and don't want to have to study or modify source code, this alternative download is faster and easier. It does not require Perl. The errors and character encoding issues are already taken care of, and it has the additional 1231 lines in the Parts of Speech tables. The package was built from an already fully constructed moby database, using an output script for backing up the database, which is also provided in the package. The download size is about 17MB. The .zip expands to about 64MB of files. 

Please request by email. Read FAQ first. US$12.00

All that's required is that you install it in MySQL. These are the instructions:

  1. Copy the distribution .zip file into a disk folder where it is the only file.
  2. Unzip it.
  3. This isn't required, but changing the line ends makes the two .sql files display properly in Notepad:
    In Windows only, use WordPad to open these two files:
    MobyCreateDatabase.sql
    MobySelectIntoOutfiles.sql
    Without making any changes to the files, Save each (Ctrl+S). This converts the line ends from Linux LF to Windows CRLF. Do not edit any of the -tsv.txt files this way (or at least don't Save them). For the import scripts to work properly, these files must have their LF line ends.
  4. Make sure your MySQL server is running.
  5. Open a Terminal (Linux) or Command Prompt (Windows).
  6. Navigate to the folder where the .zip contents were extracted to (that is, where the plain text files are). Use the cd command in both Linux and Windows:
    Linux     : cd dir1/dir2/
    Windows: cd dir1\dir2\
  7. Launch the MySQL Command Line Client:
    mysql --local-infile -uroot -p
    Enter your MySQL password when prompted.
  8. Use this MySQL command to make sure you do not already have a database called 'moby':
    SHOW DATABASES;
    If a database called moby already exists, the installation script will destroy it. You will need to either rename the existing database or modify this project's installation script to use a database name other than 'moby'.
  9. Run the MobyCreateDatabase.sql installation script with this MySQL command:
    SOURCE MobyCreateDatabase.sql;
    On my PC, the installation takes about 10 minutes in Ubuntu and 30 minutes in Windows. One of the tables has 2.5 million rows.
    When it's finished, a list of the tables is displayed, and the database is ready for use.

Troubleshooting

  • If the MobyCreateDatabase.sql script generates this error:
    [ERROR in query 24] The used command is not allowed with this MySQL version,
    it actually has nothing to do with the MySQL version. It is because you apparently do not have permissions on the server side of MySQL to use the LOAD DATA LOCAL INFILE command. To resolve the problem, please see the instructions above for the free version of the script.  

PrepMoby.pl

#!/usr/bin/perl -wT
#
# PrepMoby.pl	2011-10-19
# This file is part of the MobyWordListsMySQL project.
# Copyright (C)2010, 2011 Steven Whitney.
# Initially published by http://25yearsofprogramming.com.
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# Version 3 as published by the Free Software Foundation.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
#
# NOTES:
# This script uses Perl to prepare for importing two of the Moby Word Lists text files into the Moby MySQL database.
# It creates modified copies of the files without changing the original files.
# 1) It translates the mobypos.txt file from Macintosh MacRoman charset to MySQL latin1 (Windows-1252).
# 2) It fixes problems with the original Moby mthesaur.txt file.
#
# Much of these procedures can be done more efficiently with ssed and Linux shell commands, but one goal of this
# project was that it should be cross-platform, and Windows counterparts (especially sort) don't have 
# equivalent functionality.
# 
# CHANGELOG:
# 2011-10-19: 
#   Made the code in step 2b) fast enough to use, re-enabled it, and removed the workaround. 
#   Changed method of removing line ends and whitespace.

use strict;
use warnings;

use Encode;

# --------------------------------------------------------------------------------
# MOBY PARTS OF SPEECH (MOBYPOS) PREPARATION
# The original mobypos.txt is in MacOS MacRoman character encoding (although it has CRLF line ends).
# MySQL cannot directly import some of the characters into its latin1 character set. 
# This script translates the file to cp1252 (Windows 1252, which is the same as MySQL latin1).
# It seems that it cannot translate it all the way to ASCII. I had to do that manually with the linux iconv utility.
# In the Perl "Encode" class, the "substitution" character for illegal chars is always "?", not a char chosen to be similar. 
# So iconv is the only conversion method to ASCII I've found, and it's Linux-only, not Windows. 

print(STDERR "Transcoding mobypos.txt (MacRoman) to mobypos-cp1252.txt (Windows-1252 aka latin1)...\n");

my $infile = "mobypos.txt";
my $outfile = "mobypos-cp1252.txt";
open(INFILE, "<", $infile) or die("Can't open $infile: $!");
open(OUTFILE, ">", $outfile) or die("Can't open $outfile: $!");
# binmode prevents \n from being converted on output to \r\n in Windows
# it is necessary because the line terminator must be explicitly declared in MySQL LOAD DATA LOCAL INFILE;
# it doesn't adjust itself to the operating system, so we must ensure that our output here is the same for Win/Lin.
binmode(OUTFILE);	

my $line;
while($line = <INFILE>)				# reads 1 line at a time
{
	# REMOVE LEADING, TRAILING WHITESPACE, NEWLINES.
	$line =~ s/^[[:space:]]*(.*?)[[:space:]]*$/$1/;
	if(length($line) > 0)			# ensure whatever is left is not a blank line
	{
		Encode::from_to($line, "MacRoman", "cp1252", Encode::FB_DEFAULT);
		# This just turns non-ASCII chars into question marks. 
		# It doesn't choose a similar replacement char like iconv does. Unusable.
		# Encode::from_to($line, "cp1252", "ascii", Encode::FB_WARN);	
		print(OUTFILE "$line\n");	# put back a consistent \n in either Lin/Win.
	}
}
close(INFILE);
close(OUTFILE);


# --------------------------------------------------------------------------------
# MOBY THESAURUS PREPARATION.
# 1) MAKE A CLEAN COPY OF MTHESAUR.TXT, FIXING LINES WITH MULTIPLE CONSECUTIVE COMMAS AND WITH COMMA(S) AT LINE ENDS.
print(STDERR "MAKING A CLEAN COPY OF MTHESAUR.TXT...\n");

$infile = "mthesaur.txt";
$outfile = "mthesaur-clean.txt";
open(INFILE, "<", $infile) or die("Can't open $infile: $!");
open(OUTFILE, ">", $outfile) or die("Can't open $outfile: $!");
binmode(OUTFILE);	

while($line = <INFILE>)
{
	$line =~ s/^[[:space:]]*(.*?)[[:space:]]*$/$1/;
	if(length($line) > 0)			# ensure whatever is left is not a blank line
	{
		$line =~ s/,,+/,/gi;		# condense multiple commas to one
		$line =~ s/,$//;			# remove trailing commas
		print(OUTFILE "$line\n");	# put back a consistent \n.
	}	
}
close(INFILE);
close(OUTFILE);

# 2) CREATE FILE FOR MAKING A SIMPLE PHRASES LIST. REQUIRES TWO STEPS, USING A TEMP FILE.
# 2a) PUT EACH WORD OR PHRASE ON ITS OWN LINE.
print(STDERR "CREATING TEMP FILE TO MAKE A UNIQUE PHRASES LIST...\n");

$infile = "mthesaur-clean.txt";
$outfile = "mthesaur-phrases-unsorted.txt";	# source file to later sort+uniqueify, which you can do manually if desired.
open(INFILE, "<", $infile) or die("Can't open $infile: $!");
open(OUTFILE, ">", $outfile) or die("Can't open $outfile: $!");
binmode(OUTFILE);

while($line = <INFILE>)
{
	$line =~ s/^[[:space:]]*(.*?)[[:space:]]*$/$1/;
	if(length($line) > 0)			# ensure whatever is left is not a blank line
	{
		if($line =~ /^,/)			# paranoid check; should never happen
		{
			close(INFILE);
			close(OUTFILE);
			unlink("mthesaur-phrases-unsorted.txt");		
			die("mthesaur-clean.txt is corrupt. A line starts with a comma.\n");
		}
		$line =~ s/,/\n/g;			# change all commas to line ends
		print(OUTFILE "$line\n");	# write it back, adding the missing \n at original real end of line
	}	
}
close(INFILE);
close(OUTFILE);

# 2b) NOW SORT IT AND CULL TO A LIST OF UNIQUES. 

print(STDERR "SORTING AND CULLING PHRASES LIST TO UNIQUE ENTRIES ONLY...\n");

$infile = "mthesaur-phrases-unsorted.txt";
$outfile = "mthesaur-phrases-uniq.txt";	
open(INFILE, "<", $infile) or die("Can't open $infile: $!");
open(OUTFILE, ">", $outfile) or die("Can't open $outfile: $!");
binmode(OUTFILE);

my %phrasearray = ();
while(<INFILE>) 
{
	s/^[[:space:]]*(.*?)[[:space:]]*$/$1/;
	$phrasearray{$_} = 1 if(length);
}
foreach(sort keys %phrasearray) 
{
	print(OUTFILE "$_\n");
}
close(INFILE);
close(OUTFILE);


# 3) CREATE THE SOURCE FILE FOR THE MYSQL THESAURUS.
# Converts the cleaned copy of the Moby mthesaur.txt file
# FROM
#	root,related1,related2...\n
# TO
#	root,root\n
#	root,related1\n
#	root,related2\n
#	...
print(STDERR "CREATING THE SOURCE FILE FOR THE MYSQL THESAURUS...\n");

$infile = "mthesaur-clean.txt";
$outfile = "mthesaur-split.txt";
open(INFILE, "<", $infile) or die("Can't open $infile: $!");
open(OUTFILE, ">", $outfile) or die("Can't open $outfile: $!");
binmode(OUTFILE);

my $rootword;
while($line = <INFILE>)							# reads 1 line at a time into the scalar (non-array) variable
{
	$line =~ s/^[[:space:]]*(.*?)[[:space:]]*$/$1/;
	if(length($line) > 0)						# ensure whatever is left is not a blank line
	{
		$line =~ /^([^,]*).*/;					# just parses for regex, takes no action
		$rootword = $1;							# the word or phrase before the first comma is the root
		$line =~ s/([^,]+),?/$rootword,$1\n/gi;	# insert root word and comma before every word; separate entries with newlines
		print(OUTFILE "$line");					# the newlines are already embedded	
	}
}
close(INFILE);
close(OUTFILE);

print(STDERR "DONE.\n");
print(STDERR "The next step is to run the CreateMobyDatabase.sql script in MySQL.\n");

CreateMobyDatabase.sql

/*	CreateMobyDatabase.sql	2011-10-19
	Creates and populates all the tables in the MOBY MySQL database.
	This file is part of the MobyWordListsMySQL project.
	Copyright (C)2010, 2011 Steven Whitney.
	Initially published by http://25yearsofprogramming.com.

	This program is free software; you can redistribute it and/or
	modify it under the terms of the GNU General Public License
	Version 3 as published by the Free Software Foundation.
	This program is distributed in the hope that it will be useful,
	but WITHOUT ANY WARRANTY; without even the implied warranty of
	MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
	GNU General Public License for more details.
	You should have received a copy of the GNU General Public License
	along with this program; if not, write to the Free Software
	Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.

NOTES:
--Adding indexes for pos and thesaurus made a huge speed improvement (100X?) for some queries. 
--The script does not create tables for these files: USACONST.TXT (US Constitution), roget13a.txt (sample thesaurus formatting).
--Dropping a table and recreating it is a way to reset an auto_increment field to start at 1. 

CHANGELOG:
2011-10-19:
  Simplified the code for loading mobythesauruswordlist because Perl can now create its data source quickly.

*/
-- Automatically display all errors and warnings.
WARNINGS;

DROP DATABASE IF EXISTS `moby`;
-- The default collation is case-insensitive, which is probably better for most purposes.
CREATE DATABASE IF NOT EXISTS `moby`
	  CHARACTER SET latin1 COLLATE latin1_swedish_ci;
USE `moby`;

-- Most of the tables are imported directly from the Moby distribution text files, which have line ends of \r\n.

DROP TABLE IF EXISTS `crosswordlegal`;
CREATE TABLE `crosswordlegal`
(
	`alphaid` INT UNSIGNED NOT NULL AUTO_INCREMENT,
	`alpha` VARCHAR(255) NOT NULL,
	PRIMARY KEY (`alphaid`)
) ENGINE = INNODB;
-- LOAD FROM SCRABBLE DICTIONARY VERSION 1 FILE
LOAD DATA LOCAL INFILE 'CROSSWD.TXT' 
INTO TABLE `crosswordlegal` 
LINES TERMINATED BY '\r\n'
(alpha) 
;
-- AND FROM SCRABBLE DICTIONARY VERSION 2 FILE
LOAD DATA LOCAL INFILE 'CRSWD-D.TXT' 
INTO TABLE `crosswordlegal` 
LINES TERMINATED BY '\r\n'
(alpha) 
;

DROP TABLE IF EXISTS `acronyms`;
CREATE TABLE `acronyms` LIKE `crosswordlegal`;
LOAD DATA LOCAL INFILE 'ACRONYMS.TXT' 
INTO TABLE `acronyms` 
LINES TERMINATED BY '\r\n'
(alpha) 
;

DROP TABLE IF EXISTS `commondictionarywords`;
CREATE TABLE `commondictionarywords` LIKE `crosswordlegal`;
LOAD DATA LOCAL INFILE 'COMMON.TXT' 
INTO TABLE `commondictionarywords` 
LINES TERMINATED BY '\r\n'
(alpha) 
;

DROP TABLE IF EXISTS `compoundwords`;
CREATE TABLE `compoundwords` LIKE `crosswordlegal`;
LOAD DATA LOCAL INFILE 'COMPOUND.TXT' 
INTO TABLE `compoundwords` 
LINES TERMINATED BY '\r\n'
(alpha) 
;

DROP TABLE IF EXISTS `wordfreqenglish`;
CREATE TABLE `wordfreqenglish` LIKE `crosswordlegal`;
LOAD DATA LOCAL INFILE 'FREQ.TXT' 
INTO TABLE `wordfreqenglish` 
LINES TERMINATED BY '\r\n' 
IGNORE 2 LINES 
(alpha) 
;

DROP TABLE IF EXISTS `namesall`;
CREATE TABLE `namesall` LIKE `crosswordlegal`;
LOAD DATA LOCAL INFILE 'NAMES.TXT' 
INTO TABLE `namesall` 
LINES TERMINATED BY '\r\n' 
(alpha) 
;

DROP TABLE IF EXISTS `namesf`;
CREATE TABLE `namesf` LIKE `crosswordlegal`;
LOAD DATA LOCAL INFILE 'NAMES-F.TXT' 
INTO TABLE `namesf` 
LINES TERMINATED BY '\r\n' 
(alpha) 
;

DROP TABLE IF EXISTS `namesm`;
CREATE TABLE `namesm` LIKE `crosswordlegal`;
LOAD DATA LOCAL INFILE 'NAMES-M.TXT' 
INTO TABLE `namesm` 
LINES TERMINATED BY '\r\n' 
(alpha) 
;

DROP TABLE IF EXISTS `oftenmisspelled`;
CREATE TABLE `oftenmisspelled` LIKE `crosswordlegal`;
LOAD DATA LOCAL INFILE 'OFTENMIS.TXT' 
INTO TABLE `oftenmisspelled` 
LINES TERMINATED BY '\r\n' 
(alpha) 
;

DROP TABLE IF EXISTS `places`;
CREATE TABLE `places` LIKE `crosswordlegal`;
LOAD DATA LOCAL INFILE 'PLACES.TXT' 
INTO TABLE `places` 
LINES TERMINATED BY '\r\n' 
(alpha) 
;

DROP TABLE IF EXISTS `singlewords`;
CREATE TABLE `singlewords` LIKE `crosswordlegal`;
LOAD DATA LOCAL INFILE 'SINGLE.TXT' 
INTO TABLE `singlewords` 
LINES TERMINATED BY '\r\n' 
(alpha) 
;

DROP TABLE IF EXISTS `wordfreqenglishinternet`;
CREATE TABLE `wordfreqenglishinternet`
(
	`alphaid` INT UNSIGNED NOT NULL AUTO_INCREMENT,
	`alpha` VARCHAR(255) NOT NULL,
	`freqpct` DOUBLE PRECISION NOT NULL,
	PRIMARY KEY (`alphaid`)
) ENGINE = INNODB;
LOAD DATA LOCAL INFILE 'FREQ-INT.TXT' 
INTO TABLE `wordfreqenglishinternet` 
FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\r\n'
IGNORE 2 LINES 
(freqpct,alpha) 
;

DROP TABLE IF EXISTS `amytansubstringfrequencies`;
CREATE TABLE `amytansubstringfrequencies`
(
	-- id not alphaid to emphasize that these are substrings unrelated to words in the other lists
	`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
	`freq` INT UNSIGNED NOT NULL,
	`alpha` VARCHAR(255) NOT NULL,
	PRIMARY KEY (`id`)
) ENGINE = INNODB;
LOAD DATA LOCAL INFILE 'FICTION.TXT' 
INTO TABLE `amytansubstringfrequencies` 
FIELDS TERMINATED BY '<' 
LINES TERMINATED BY '>\r\n'
(freq,alpha) 
;

DROP TABLE IF EXISTS `kingjamessubstringfrequencies`;
CREATE TABLE `kingjamessubstringfrequencies` LIKE `amytansubstringfrequencies`;
LOAD DATA LOCAL INFILE 'KJVFREQ.TXT' 
INTO TABLE `kingjamessubstringfrequencies` 
FIELDS TERMINATED BY '<' 
LINES TERMINATED BY '>\r\n'
(freq,alpha) 
;

-- A Moby Parts of Speech table in the format of the original text file.
DROP TABLE IF EXISTS `mobypos`;
CREATE TABLE `mobypos`
(
	`alphaid` INT UNSIGNED NOT NULL AUTO_INCREMENT,
	`alpha` VARCHAR(255) NOT NULL,					-- case can be important, but probably best not to use binary collation as default.
	`pos` VARCHAR(20) NOT NULL COLLATE latin1_bin,	-- the pos field chars are case sensitive. You must use REGEXP BINARY.
	PRIMARY KEY (`alphaid`),
	INDEX (`alpha`)
) ENGINE = INNODB;
LOAD DATA LOCAL INFILE 'mobypos-cp1252.txt'			-- in this file, European words have their accented characters 
INTO TABLE `mobypos` 
FIELDS TERMINATED BY '\\' 
	ESCAPED BY ''			-- required to make the field terminator above work properly
LINES TERMINATED BY '\n'
(alpha,pos) 
;

/* 
(OPTIONAL. IF YOU DO NOT WANT TO DO THIS, COMMENT OUT THE CODE...)
ADD THE 7-BIT ASCII TRANSLITERATIONS OF ENTRIES WITH NON-ENGLISH CHARACTERS. 
FOR EXAMPLE, ICONV REPLACES FRENCH ACCENTED "E" CHARACTERS WITH A PLAIN "E".  
IF THIS FILE IS IMPORTED INTO THE TABLE, EACH OF THOSE ENTRIES WILL HAVE TWO ROWS, ONE WITH ACCENTS, 
ONE WITHOUT, WHICH ALLOWS DATABASE SEARCHES ON EITHER ONE. 
THE INPUT FILE HAS 1231 LINES. IT WAS CREATED IN LINUX WITH: 
iconv -f CP1252 -t ASCII//TRANSLIT mobypos-cp1252.txt > mobypos.ascii;
diff mobypos-cp1252.txt mobypos.ascii | grep -iP '^\> ' | ssed -Re '{s/\> (.*)/\1/}' > mobypos-ascii-additions.txt;
*/
LOAD DATA LOCAL INFILE 'mobypos-ascii-additions.txt' 
INTO TABLE `mobypos` 
FIELDS TERMINATED BY '\\' 
	ESCAPED BY ''			-- required to make the field terminator above work properly
LINES TERMINATED BY '\n'
(alpha,pos) 
;

-- A transformed version of Moby Parts of Speech. 
-- Instead of 1 entry per word, with a list of parts of speech per word (word\NV),
-- it has multiple entries for multi-pos words, with only 1 pos per entry (word\N, word\V).
-- In the source file, POS are listed in decreasing order of usage frequency. freqrank preserves that important ranking.
DROP TABLE IF EXISTS `mobypos1per`;
CREATE TABLE `mobypos1per`
(
	`alphaid` INT UNSIGNED NOT NULL AUTO_INCREMENT,
	`alpha` VARCHAR(255) NOT NULL,
	`pos` CHAR(1) NOT NULL COLLATE latin1_bin,		-- the pos field chars are case sensitive. You must use REGEXP BINARY.
	`freqrank` TINYINT UNSIGNED NOT NULL DEFAULT 0,	-- preserves significance of the pos list order from mobypos
	PRIMARY KEY (`alphaid`),
	INDEX (`alpha`),
	INDEX (`pos`)
) ENGINE = INNODB;
-- build it by extracting from mobypos
INSERT INTO mobypos1per (alpha, pos, freqrank) SELECT alpha, MID(pos,1,1), 1 FROM mobypos WHERE LENGTH(pos) >= 1;
INSERT INTO mobypos1per (alpha, pos, freqrank) SELECT alpha, MID(pos,2,1), 2 FROM mobypos WHERE LENGTH(pos) >= 2;
INSERT INTO mobypos1per (alpha, pos, freqrank) SELECT alpha, MID(pos,3,1), 3 FROM mobypos WHERE LENGTH(pos) >= 3;
INSERT INTO mobypos1per (alpha, pos, freqrank) SELECT alpha, MID(pos,4,1), 4 FROM mobypos WHERE LENGTH(pos) >= 4;
INSERT INTO mobypos1per (alpha, pos, freqrank) SELECT alpha, MID(pos,5,1), 5 FROM mobypos WHERE LENGTH(pos) >= 5;
INSERT INTO mobypos1per (alpha, pos, freqrank) SELECT alpha, MID(pos,6,1), 6 FROM mobypos WHERE LENGTH(pos) >= 6;
INSERT INTO mobypos1per (alpha, pos, freqrank) SELECT alpha, MID(pos,7,1), 7 FROM mobypos WHERE LENGTH(pos) >= 7;
INSERT INTO mobypos1per (alpha, pos, freqrank) SELECT alpha, MID(pos,8,1), 8 FROM mobypos WHERE LENGTH(pos) >= 8;
-- accuracy check: total number of parts of speech should equal total number of new records, 1 part of speech each
SELECT SUM(LENGTH(`pos`)) FROM mobypos;  
SELECT COUNT(*) FROM mobypos1per; 

-- THESAURUS SECTION.

DROP TABLE IF EXISTS `mobythesauruswordlist`;
CREATE TABLE `mobythesauruswordlist` LIKE `crosswordlegal`;

-- THIS LOADS THE TABLE mobythesauruswordlist.
LOAD DATA LOCAL INFILE 'mthesaur-phrases-uniq.txt' 
INTO TABLE `mobythesauruswordlist` 
LINES TERMINATED BY '\n' 
(alpha);


-- Create the main thesaurus table. The Perl script created the input file for this, 
-- changing the thesaurus format
-- FROM
--	root,related1,related2...\n
-- TO
--	root,root\n
--	root,related1\n
--	root,related2\n
DROP TABLE IF EXISTS `mobythesaurus`;
CREATE TABLE `mobythesaurus`
(
	`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
	`root` VARCHAR(255) NOT NULL,
	`related` VARCHAR(255) NOT NULL,
	PRIMARY KEY (`id`),
	INDEX (`root`),
	INDEX (`related`)
) ENGINE = INNODB;
LOAD DATA LOCAL INFILE 'mthesaur-split.txt' 
INTO TABLE `mobythesaurus` 
FIELDS TERMINATED BY ',' 
LINES TERMINATED BY '\n'
(root,related) 
;

SHOW TABLES;

 

 

Valid HTML 4.01 Transitional
Yahoo! Search
Search the web Search this site
Valid CSS