25 Years of Programming Community Forum
Blog  Sitemap  Services
June 18, 2013, 09:07:53 PM *
Welcome, Guest. Please login or register.

Login with username, password and session length
News: If you get a (403 - Forbidden) error while trying to browse the forum, it is because your browser is disallowing cookies.
 
   Home   Help Search Login Register  
This is a link to the Chat Room (for Firefox+ChatZilla) when you are logged in.
View help topic about using Live Chat
Pages: 1   Go Down
  Print  
Author Topic: If not user-agent then env=blocked  (Read 983 times)
0 Members and 1 Guest are viewing this topic.
wilson
Newbie
*
Offline Offline

Posts: 5


« on: August 06, 2012, 03:23:50 AM »

Background: I recently made another branch off main domain. Using add-on domain-dot-com to put files and pages that didn't fit preferred keyword profile.

In this new domain I wanted to try something like a white list for user agents (search engines) but we all know white lists can grow as large as blacklists so I thought there has to be be some "If not then" route.

In all my robots.txt files I have allowed major search engines and disallowed all others. Of course this doesn't keep any naughty bots out, but it's there for any who respect it.

Question: Am I even close?  Undecided

SetEnvIfNoCase User-agent !"Googlebot-Image" blocked
SetEnvIfNoCase User-agent !"Mediapartners-Google" blocked
SetEnvIfNoCase User-agent !"Adsbot-Google" blocked
SetEnvIfNoCase User-agent !"Googlebot" blocked
SetEnvIfNoCase User-agent !"Slurp" blocked
SetEnvIfNoCase User-agent !"Teoma" blocked
SetEnvIfNoCase User-agent !"msnbot" blocked

<Files *>
Order allow,deny
deny from env=blocked
allow from all
</Files>
Report to moderator   Logged
SteveW
Administrator
Sr. Member
*****
Offline Offline

Posts: 285


WWW
« Reply #1 on: August 07, 2012, 02:25:17 AM »

That .htaccess code will result in all User-agents (and all visitors) being blocked. The SetEnvIfNoCase statements are considered separately, not as a set, so there is an implicit OR between them. If any one of them is true (and most will always be true), then "blocked" gets set.

For complicated rules, it's easier to use RewriteCond and RewriteRule, like the following. When using this method, leave out the <Files> code entirely.

A block of RewriteConds is a set, and they have an implied AND between them (which is what you want) unless OR is explicitly stated (with [NC,OR]):

RewriteCond %{HTTP_USER_AGENT} !"Googlebot-Image" [NC]
RewriteCond %{HTTP_USER_AGENT} !"Mediapartners-Google" [NC]
RewriteCond %{HTTP_USER_AGENT} !"Adsbot-Google" [NC]
RewriteCond %{HTTP_USER_AGENT} !"Googlebot" [NC]
RewriteCond %{HTTP_USER_AGENT} !"Slurp" [NC]
RewriteCond %{HTTP_USER_AGENT} !"Teoma" [NC]
RewriteCond %{HTTP_USER_AGENT} !"msnbot" [NC]
RewriteRule .* - [F]

Unfortunately, the above code will ONLY allow robots, and lock out all other visitors! So you would have to add lines for all the browsers that human visitors use.

Whitelisting by UA might not really be very practical.
Report to moderator   Logged
wilson
Newbie
*
Offline Offline

Posts: 5


« Reply #2 on: August 07, 2012, 08:42:41 AM »

That's what I figured, as far as deny all.

As much as I've searched I've never found a way to set crawl limit in htaccess. This is really what I'm after:

If not a preferred search engine
behave like normal user
or this

Example: if you're not google and you're crawling like a search engine then redirect to captcha or 403

I'll keep looking thanks, Doug Wilson
Report to moderator   Logged
Pages: 1   Go Up
  Print  
 
Jump to:  

Yahoo! Search
Search the web Search this site
Mazeguy Smilies Powered by MySQL Powered by PHP Powered by SMF 1.1.16 | SMF © 2011, Simple Machines Valid XHTML 1.0! Valid CSS!