icon Get the most out of Surmunity, read our tips here! Need an interesting blog to read? You've got to read the Surpass Blog! | Welcome! Please register to access all of our features.

» Surpass Web Hosting Forums » Discussions » Shared Hosting » Robots.txt

Shared Hosting Questions about your shared hosting account.

Reply
 
LinkBack Thread Tools Search this Thread
Old July 3rd, 2004, 6:53 PM   #28 (permalink)
minor deity
Super #1
 
Bigjohn's Avatar
 
Joined in Apr 2004
Lives in Georgia
Hosted on XEON
7,386 posts
Gave thanks: 27
Thanked 94 times
With the following qualifications:
Quote:
(domain name substituted with mydomain.com)
The deny from IP block is used for a bad bot trap. At the bottom of the .htaccess there is some code which prevents hotlinking to images, allowing it however from Google and other search engines:
and
Quote:
The anti-Hotlink-Part can be written using SetEnvIf as well.
I recently posted an example on another section of this board:
http://www.webmasterworld.com/forum10/1862.htm
and
Quote:
Change the Options line in the code above to

Options -Indexes +FollowSymLinks

and see if that fixes your problem. Your host may not have SymLinks enabled by default, and mod_rewrite requires it.

Remember that this forum drops the space required between "}" and "!" in the RewriteConds and RewriteRules. If you cut-n-pasted directly, you'll need to add them back in. Example:

RewriteCond %{HTTP_REFERER} SPACE_REQUIRED_HERE !^http://(www\.)?mydomain.com.*$ [NC]

And in the following line, you must also replace the broken vertical pipe "¦" character with a solid vertical pipe character.

RewriteRule \.(jpg¦JPG)$ http://www.mydomain.com/images/replace.gif [R,L]
You need to find a way to read the full original thread, since my cut and paste may have stripped some of the formatting.

John
__________________
Proud to be a Surmunity Mod!
XEON PASS60 PASS61
Make a fundamental difference!
My Sites:
Curious about Brewing Beer? Join the community!
>>>>> Some Change is GOOD! Keep your paycheck! Support the Fair Tax
Get into an Art museum
Victorian London
It's your brain -ON WEB - mybrainhost.com (under development)
What SHOULD Government do? Much Less than it Does!
Bigjohn is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Old October 22nd, 2004, 5:34 PM   #29 (permalink)
Registered User
Seasoned Poster
 
Joined in Jun 2004
Lives in Louisville, Ky.
Hosted on Mac, Muy, Pass21
86 posts
Gave thanks: 0
Thanked 0 times
Don't mean to be hashing up old threads but i need to ask a little help on this matter. I am having a problem with people actually using scripts to download and mirror my site. I keep banning but they keep using different ips and such. My question is this, my Mod rewrite looks very different from the examples you have provided so I am not sure mine is even effective.
My htaccess looks like this


Code:
Options Indexes FollowSymLinks Includes
AddType application/x-httpd-cgi .cgi
AddType text/x-server-parsed-html .html

Redirect temp /forum/printthread.php http://mydomain.com/index.php


order allow,deny
allow from all




RewriteEngine on
Options Indexes FollowSymLinks Includes
AddType application/x-httpd-cgi .cgi
AddType text/x-server-parsed-html .html
RewriteBase /
RewriteCond %{HTTP_REFERER} !^http://mydomain.com/.*$      [NC]
RewriteCond %{HTTP_REFERER} !^http://mydomain.com$      [NC]
RewriteCond %{HTTP_REFERER} !^http://www.mydomain.com/.*$      [NC]
RewriteCond %{HTTP_REFERER} !^http://www.mydomain.com$      [NC]
RewriteCond %{HTTP_REFERER} !^http://mydomain.com/.*$      [NC]
RewriteCond %{HTTP_REFERER} !^http://mydomain.com$      [NC]
RewriteCond %{HTTP_REFERER} !^http://www.mydomain.com/.*$      [NC]
RewriteCond %{HTTP_REFERER} !^http://www.mydomain.com$      [NC]
RewriteRule .*\.(jpg|jpeg|gif|png|bmp)$ - [F,NC]



RewriteEngine on
Options Indexes FollowSymLinks Includes
AddType application/x-httpd-cgi .cgi
AddType text/x-server-parsed-html .html
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^Alexibot [OR]
RewriteCond %{HTTP_USER_AGENT} ^asterias [OR]
RewriteCond %{HTTP_USER_AGENT} ^BackDoorBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Black.Hole [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlowFish [OR]
RewriteCond %{HTTP_USER_AGENT} ^BotALot [OR]
RewriteCond %{HTTP_USER_AGENT} ^BuiltBotTough [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bullseye [OR]
RewriteCond %{HTTP_USER_AGENT} ^BunnySlippers [OR]
RewriteCond %{HTTP_USER_AGENT} ^Cegbfeieh [OR]
RewriteCond %{HTTP_USER_AGENT} ^CheeseBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^CopyRightCheck [OR]
RewriteCond %{HTTP_USER_AGENT} ^cosmos [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DittoSpyder [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^EroCrawler [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Foobot [OR]
RewriteCond %{HTTP_USER_AGENT} ^FrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^Googlebot-Image [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^Harvest [OR]
RewriteCond %{HTTP_USER_AGENT} ^hloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} ^httplib [OR]
RewriteCond %{HTTP_USER_AGENT} ^HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^humanlinks [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InfoNaviRobot [OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JennyBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Kenjin.Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Keyword.Density [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^LexiBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^libWeb/clsHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^libwww-perl/5.800 [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkextractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkScan/8.1a.Unix [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^lwp-trivial [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mata.Hari [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIIxpc [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister.PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^moget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/2 [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/3.Mozilla/2.01 [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetMechanic [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^NPBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline.Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^Openfind [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^ProPowerBot/2.14 [OR]
RewriteCond %{HTTP_USER_AGENT} ^ProWebWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^ProWebWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^QueryN.Metasearch [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^RepoMonkey [OR]
RewriteCond %{HTTP_USER_AGENT} ^RMA [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SlySearch [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SpankBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^spanner [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^suzuran [OR]
RewriteCond %{HTTP_USER_AGENT} ^Szukacz/1.4 [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Telesoft [OR]
RewriteCond %{HTTP_USER_AGENT} ^The.Intraformant [OR]
RewriteCond %{HTTP_USER_AGENT} ^TheNomad [OR]
RewriteCond %{HTTP_USER_AGENT} ^TightTwatBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^Titan [OR]
RewriteCond %{HTTP_USER_AGENT} ^toCrawl/UrlDispatcher [OR]
RewriteCond %{HTTP_USER_AGENT} ^toCrawl/UrlDispatcher [OR]
RewriteCond %{HTTP_USER_AGENT} ^True_Robot [OR]
RewriteCond %{HTTP_USER_AGENT} ^turingos [OR]
RewriteCond %{HTTP_USER_AGENT} ^TurnitinBot/1.5 [OR]
RewriteCond %{HTTP_USER_AGENT} ^URLy.Warning [OR]
RewriteCond %{HTTP_USER_AGENT} ^VCI [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebBandit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEnhancer [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.Image.Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebmasterWorldForumBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website.Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^Webster.Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZip [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWW-Collector-E [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xenu's [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.*$ nastyone.php [L]




deny from 202.159.222.214
deny from 216.25.85.51
deny from 65.108.23.86
your examples say use the "setenv"
where mine uses "rewritecond".
Do I need to change this?

Can any one tell me if this will work?
Is there a way to check if it is working?
Thanks in advance,
Lady Madness is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Old October 24th, 2004, 12:19 PM   #30 (permalink)
Registered User
Seasoned Poster
 
Joined in Jun 2004
Lives in Louisville, Ky.
Hosted on Mac, Muy, Pass21
86 posts
Gave thanks: 0
Thanked 0 times
*Anyone?
Lady Madness is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Old September 29th, 2005, 7:41 PM   #31 (permalink)
Surpass Fan
Comfy Contributor
 
pseudoswede's Avatar
 
Joined in Jun 2003
Lives in Denver
Hosted on D9
142 posts
Gave thanks: 4
Thanked 3 times
I'm going to try to implement the stuff below on my domains, but I have a few questions first. Sorry for the silly questions...

Quote:
Originally Posted by johnaikin
Here is what I try to ward off spammers:

A combination of BigJohn's great tutorial on spamassassin at Make SPAM ASSASSIN work for you...

and robots.txt, Meta tags and the following .htaccess:
Code:
Options -Indexes

SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
...(rest removed to save space)...
SetEnvIfNoCase User-Agent "^ZBot" bad_bot


<Limit GET POST>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>

SetEnvIfNoCase Request_URI ban-ip\.txt ban

<Files ~ "^.*$">
order allow,deny
allow from all
deny from env=ban
</Files>
Currently, my .htaccess file is empty. I just do a simple cut and paste everything in the code section, right?

Quote:
Originally Posted by johnaikin
Now there is some other stuff going on in there, for example, notice the line about 6 lines form the bottom? That is a little trick to specifically ban the IP addresses of spammers and spambots - like this:

On my index page, very early, I have an invisible link that looks like this
Code:
<div style="display:none;"><a href="http://www.fallcreektech.com/sandtrap">sandtrap</a></div>
/sandtrap has an index file that looks like this:
Code:
<?php
$ip = "$REMOTE_ADDR\n" ;
$banip = '/home/fallcre/public_html/ban-ip/ban-ip.txt';
$fp = fopen($banip, "a");
$write = fputs($fp, $ip);
fclose($fp);
?>
What kind of index file do I create? index.html or index.php?

Quote:
Originally Posted by johnaikin
My robots.txt has an entry specifically relating to this (there are other entries naturally):
User-agent:*
Disallow: /sandtrap
Disallow: /ban-ip

So, put all together, what I've done is use robots.txt to tell robots not to index /sandtrap, any robot that isn't well behaved will go ahead and follow the first link it finds - the hidden link to sandtrap.

Using php, I get the ip address of that robot and add it to my ban-ip file.

Then using .htaccess, any ip address in that ban-ip file won't be allowed access to my site.

The line I referred to earlier
Code:
 SetEnvIfNoCase Request_URI ban-ip\.txt ban
and the entries under it are there to prevent people from seeing my banned IP list generated from this technique.

This .htaccess file also includes blocks for file downloaders, download managers, and other nastyness I've decided to keep away from my site.

Naturally, you have to keep an eye on your logs, because the various bots come and go quickly.

By using this combination of techniques, I've cut my spam from over 900 per day to under 20.

I hope this is helpful.

John

P.S.
I wish I could figure out why ISP's don't ban a list like this using httpd.conf it would make everyone's life easier and their email servers would sure be happy about it.
The rest of this stuff looks self-explanatory.

Has anyone else tried what johnaikin wrote here? If so, how well did it work?

Thanks in advance!
__________________
"In the end, everything will be fine - if it is not fine, it is not the end."
PseudoSwede
larvez.com
Dime9
pseudoswede is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Old September 29th, 2005, 8:27 PM   #32 (permalink)
ceo
Senior Member
Super #1
 
Joined in Jan 2005
1,546 posts
Gave thanks: 70
Thanked 33 times
I haven't done what he has, but yes you do just copy everything he gave (thought the list of bad bots was helpful) and put it in htaccess, then create the link and the files he specifies. I don't think it matters what format you index is - but the file that grabs the ips and such is coded with PHP I believe.
ceo is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Old October 5th, 2005, 5:58 PM   #33 (permalink)
Surpass Fan
Comfy Contributor
 
pseudoswede's Avatar
 
Joined in Jun 2003
Lives in Denver
Hosted on D9
142 posts
Gave thanks: 4
Thanked 3 times
Oops! Another question...

Quote:
Originally Posted by johnaikin
Code:
<?php
$ip = "$REMOTE_ADDR\n" ;
$banip = '/home/fallcre/public_html/ban-ip/ban-ip.txt';
$fp = fopen($banip, "a");
$write = fputs($fp, $ip);
fclose($fp);
?>
Using php, I get the ip address of that robot and add it to my ban-ip file.
Do I also need to create a ban-ip folder and ban-ip.txt file?
__________________
"In the end, everything will be fine - if it is not fine, it is not the end."
PseudoSwede
larvez.com
Dime9
pseudoswede is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Old August 19th, 2007, 9:22 PM   #34 (permalink)
Registered User
Seasoned Poster
 
Joined in Aug 2003
46 posts
Gave thanks: 0
Thanked 0 times
I have a site that has some material I do not want seen by robots , especially web reapers. How do I hide these areas? I have used a robot.txt file but it does not work that well as some files are seen and others are not. I have tested my site with webreaper and it was able to get into places where it should not have.
franwald is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools Search this Thread
Search this Thread:

Advanced Search

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On