The Perishable Press method looks good, and their format is neat and potentially easier to understand (depending on the groupings and format you prefer). There are usually multiple ways to do anything in .htaccess. Using a format that's easy for you to read makes the code easier to revise later, if needed. In addition to their completed 4G Blacklist at
http://perishablepress.com/press/2009/03/16/the-perishable-press-4g-blacklist/, there is also an explanatory walkthrough of its methods at
http://perishablepress.com/press/2009/02/03/eight-ways-to-blacklist-with-apaches-mod_rewrite/, and there are other related articles in the site, as well. The author also posted notes about customization needed for using it with Joomla and WordPress. Someone trying to put together an .htaccess file will find those articles useful to improving their understanding. The light text on dark background made my eyes go buggy quickly, though, and I stopped reading before I really wanted to.
-----
Here are some notes about regular expressions.
^ and
$, if used, "anchor" (fix the location of, or "must match") the start and end of the target string.
A regex of:
dogwill return true if "dog" is anywhere within the target string.
^dogreturns true only if the target string starts with "dog".
dog$returns true only if the target string ends with "dog".
^dog$returns true only if the target string is exactly "dog".
When two or more terms are inside parentheses with vertical lines between them, the test will return true if either one appears at the given location:
(dog|cat|mouse)A period
. (if it does not have a backslash in front of it) will match any character.
An asterisk
* means "zero or more occurrences" of the preceding "term" (piece of a regex).
You often see the two used together like this:
.* which means "any sequence of characters, 0 or more characters long".
If you want to apply the * to a longer sequence, put that sequence inside parentheses so they get grouped together as one term:
(the)*will match 0 or more occurrences of "the", including: (no text at all, 0 occurrences), the (1 occurrence), thethe, thethethe...
Characters inside brackets, like
[ABCabc], will match any
one of the enclosed characters. You can also specify a range:
[A-Z], which means any one of the capital letters A through Z, and you can specify multiple ranges, like
[A-Za-z], which matches any upper or lowercase letter.
Let's look at an example line from above, with the parts color coded to match their descriptions:
RewriteCond %{QUERY_STRING} ^.*\.[A-Za-z0-9].* [NC,OR]This test checks the request's query string to see if it contains:
"Starts with" any sequence of characters.followed by
a literal period (the leading backslash is required when you want a period to mean "a period" instead of "any character").
followed by
any upper or lowercase letter or a digitfollowed by
any sequence of charactersall tested in a
case-insensitive manner.
Although this code does its job, some nitpicking will allow some additional explanation.
1. The sequence
^.* is unnecessary and can be eliminated. Remember that a regex of
dog matches "dog" anywhere in the target string, since it is not anchored with ^ or $ to either the start or end of the string, nor is it fixed in place relative to any other text. That's the same as
"Starts with" any sequence of characters.
2.
[A-Za-z0-9] allows for upper and lower case, but the NC flag also takes care of that, so [a-z0-9] will do.
3. The trailing
.* (any sequence of some, or no, chars) is also unnecessary, since it inherenly means that any trailing chars are completely optional and irrelevant to the match.
So what we're left with is a simpler and easier to comprehend:
RewriteCond %{QUERY_STRING} \.[a-z0-9] [NC,OR]You can see that many of the above lines can be similarly simplified, which makes them less intimidating-looking.
You can make sense of a regular expression (or build one) by examining (or building) its component pieces one by one. What it requires is a willingness to not look at it as an intimidating jumble of text, but to pay attention to its details and work out its meaning a step at a time. With practice, it gets easier.
Like the asterisk, there are some other characters that have special meanings as wildcards or other things. When you want them
not to have their special meanings, and be treated as ordinary characters, you must precede them with a backslash. I won't make a list here, but that will explain why sometimes in code you see backslashes.
In this line
RewriteCond %{REQUEST_URI} ^/(,|;|:|<|>|'>|'<|/|\\\.\.\\).
{0,9999}.* [NC,OR]
the {0,9999} in bold is another type of "quantifier", similar to the asterisk. It determines how many occurrences of the preceding group to match. In this case, it means 0 to 9999 occurrences. The group that it quantifies is simply the period that precedes it, so it means a string of any characters 0 to 9999 characters long.