Quote:
Originally Posted by Dean C
Also your addon will mean google will not try to index the page. Maybe I'm missing something here but why on earth would you not want the search engines to index your page. The only usage for this will be on blog comment pages. Just because a spambot sees your link having rel="no follow" inside of it will not mean it won't spam the email.
|
Quote:
Originally Posted by kall
Regarding the addon: I don't know why he is suggesting to have noindex in the header of each page...not something I would do myself.
|
Not to hijack the thread, but there are many good reasons not to be indexed, but it is up to each person to decide if they want to or not. That is why I gave the alternate of only including the "no follow" meta tag instead of having both the no index and no follow. It seems to me that doing one without doing the other (including the no follow in the URL, but not in the meta) is only half of the solution.
You can also tell the spider to ignore only specific parts of your site in a few different ways. One way is to use a "robots.txt" file. The robots.txt is a TEXT file (not HTML!) which has a section for each robot to be controlled. Each section has a user-agent line which names the robot to be controlled and has a list of "disallows" and "allows". Each disallow will prevent any address that starts with the disallowed string from being accessed. Similarly, each allow will permit any address that starts with the allowed string from being accessed. The (dis)allows are scanned in order, with the last match encountered determining whether an address is allowed to be used or not. If there are no matches at all then the address will be used.
Using a robots.txt file is easy. If your site is located at:
http://domain.com/mysite/index.html
you will need to be able to create a file located here:
http://domain.com/robots.txt
Here's an example:
Code:
user-agent: FreeFind
disallow: /mysite/test/
disallow: /mysite/cgi-bin/post.cgi?action=reply
disallow: /a
In this example the following addresses would be ignored by the spider:
Code:
http://domain.com/mysite/test/index.html
http://domain.com/mysite/cgi-bin/post.cgi?action=reply&id=1
http://domain.com/mysite/cgi-bin/post.cgi?action=replytome
http://domain.com/abc.html
and the following ones would be allowed:
Code:
http://domain.com/mysite/test.html
http://domain.com/mysite/cgi-bin/post.cgi?action=edit
http://domain.com/mysite/cgi-bin/post.cgi
http://domain.com/bbc.html
It is also possible to use an "allow" in addition to disallows. For example:
Code:
user-agent: FreeFind
disallow: /cgi-bin/
allow: /cgi-bin/Ultimate.cgi
allow: /cgi-bin/forumdisplay.cgi
This robots.txt file prevents the spider from accessing every cgi-bin address from being accessed except Ultimate.cgi and forumdisplay.cgi.
Using allows can often simplify your robots.txt file.
Here's another example which shows a robots.txt with two sections in it. One for "all" robots, and one for the FreeFind spider:
Code:
user-agent: *
disallow: /cgi-bin/
user-agent: FreeFind
disallow:
In this example all robots except the FreeFind spider will be prevented from accessing files in the cgi-bin directory. FreeFind will be able to access all files (a disallow with nothing after it means "allow everything").
Examples:
To prevent FreeFind from indexing your site at all:
Code:
user-agent: FreeFind
disallow: /
To prevent FreeFind from indexing common Front Page image map junk:
Code:
user-agent: FreeFind
disallow: /_vti_bin/shtml.exe/
To prevent FreeFind from indexing a test directory and a private file:
Code:
user-agent: FreeFind
disallow: /test/
disallow: private.html
To allow let FreeFind index everything but prevent other robots from accessing certain files:
Code:
user-agent: *
disallow: /cgi-bin/
disallow: this.html
disallow: and.html
disallow: that.html
user-agent: FreeFind
disallow:
Here are some more examples:
The exclusion:
http://mysite.com/ignore.html
prevents that file from being included in the index.
The exclusion:
http://mysite.com/archive/*
prevents everything in the "archive" directory from being included in the index.
The exclusion:
/archive/*
prevents everything in any "archive" directory from being included in the index regardless of the site it's on.
The exclusion:
http://mysite.com/*.txt
prevents files on "mysite.com" that end with the extension ".txt" from being included in the index.
The exclusion:
*.txt
prevents all files that end with the extension ".txt" from being included in the index regardless of what site they're on.
The exclusion:
http://mysite.com/alphaindex/?.html
prevents a file like "http://mysite.com/alphaindex/a.html" from being indexed, but would allow a file "http://mysite.com/alphaindex/aardvark.html" to be indexed.
The exclusion:
http://mysite.com/alphaindex/?.html index=no follow=yes
prevents a file like "http://mysite.com/alphaindex/a.html" from being added to the index but would allow the spider to find and follow the links in that page.
The exclusion:
http://mysite.com/endwiththis.html index=yes follow=no
allows that file to be added to the index but prevents the spider from following any of the links in that file.