Robot Exclusion?

Random stuff about serendipity. Discussion, Questions, Paraphernalia.
Post Reply
boone
Regular
Posts: 16
Joined: Sat Jan 17, 2004 3:09 am
Contact:

Robot Exclusion?

Post by boone »

Is there any easy way to tell web spiders like Googlebot to not index stuff like the captcha graphics?
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Re: Robot Exclusion?

Post by garvinhicking »

Using a robots.txt like this:

Code: Select all

user-agent:*
disallow:/serendipity/plugin/
should help. I don't know if wildcards are allowed in disallow, then you could do:

Code: Select all

user-agent:*
disallow:/serendipity/plugin/spamblock_captcha*
To really only filter those. And, you can block any paths/referrers/googlebots via mod_rewrite rules of course.

Sadly, rel=nofollow does not work for <img> tags.

Regards,
garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
kjoker
Regular
Posts: 26
Joined: Fri May 13, 2005 4:32 pm

robots

Post by kjoker »

I have robots.txt in the main serendipity folder with this config:

User-agent: *
Disallow: /

But hm... robots still spidered my page. The one that is annoying is askjeeves.com. spidered categories also.

???
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Re: robots

Post by garvinhicking »

Then you should contact askjeeves and tell them to honour your spiders.txt file.

Of course you could create a block on HTAccess/VirtualHost level, but this is hard to maintain.

Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
kjoker
Regular
Posts: 26
Joined: Fri May 13, 2005 4:32 pm

thanks :)

Post by kjoker »

got it to work. actually i added the metatags noindex nofollow... and i also added another robots.txt in the root folder :)
MySchizoBuddy
Regular
Posts: 340
Joined: Sun Jun 12, 2005 5:28 am

Post by MySchizoBuddy »

garvin, what else do u think should be disallowed for security reasons
Google sitemap file is one of them.
What else.
Image
garvinhicking
Core Developer
Posts: 30022
Joined: Tue Sep 16, 2003 9:45 pm
Location: Cologne, Germany
Contact:

Post by garvinhicking »

I don't think anything should be disallowed for security reasons. It's only about files accessible via HTTP, and we already have secured those files as good as possible.

Blocking any files is only a matter if people don't like the extra traffic of robots... I have not investigated which files the robots do not need to index.

Regards,
Garvin
# Garvin Hicking (s9y Developer)
# Did I help you? Consider making me happy: http://wishes.garv.in/
# or use my PayPal account "paypal {at} supergarv (dot) de"
# My "other" hobby: http://flickr.garv.in/
Post Reply