Go Back   vb.org Archive > vBulletin Modifications > Archive > vB.org Archives > vBulletin 3.5 > vBulletin 3.5 Add-ons
FAQ Community Calendar Today's Posts Search

Reply
 
Thread Tools
Google sitemap for the vB Archives. Redirect human and robots. Details »»
Google sitemap for the vB Archives. Redirect human and robots.
Version: 1.2, by lierduh lierduh is offline
Developer Last Online: Nov 2023 Show Printable Version Email this Page

Version: 3.5.1 Rating:
Released: 08-09-2005 Last Update: 11-08-2005 Installs: 130
Uses Plugins
Code Changes Additional Files  
No support by the author.

Release V1.2 (9 Nov 2005)
* Higher sitemap priority rate is given to threads with new posts. So Google can index fresh threads first.

* Not recommending the original optional STEP 3 hack. To avoid potential Google penalty, my advice is to remove the STEP 3 hack.

Release V1.1a (12 Oct 2005)

* Bug fix only

Release V1.1 (9 Oct 2005)

* Can handle very large forums with more than 50,000 URLs per forum
URLs will be spanned through multiple files for each large forum.

* Created a function to detect search engine crawlers. The vB built-in
search engine detector can only identify about 3 or 4 search engines.
My function will detect over 20 search engine crawlers.

* Support forums hosted by web servers that do not support 'fix_pathinfo'
ie. instead of the usual 'archive/index.php/f-10.html' link. These
forums have a link as 'archive/index.php?f-10.html'.

* Alert about wrong directory permissions to help newbies.

* Automatically write index file to archive directory if the php
script can not write into the base vB directory.

* Bug fixes.


Objectives
==============
  • Create Google sitemap files and sitemap index file for vB archives, submit to Google by the Scheduled Tasks.
  • To have the vB Archive used as a mirror to the actual threads.
  • Google loves the nature of the archive pages, as they are static and do not contain repeated contents.
  • Google gauge pages heavily based on external links. We need to redirect these external thread links to the archive pages.
  • We often see vbulletin archive in the Google search results, but the users are taken to the archive page instead of the actual threads. We need to automatically redirect visitors to the actual threads instead of the archive. Otherwise the visitor either need to reclick for the Full Version or read the dull archive contents.

Q and A
==============
Q. Would the sitemap contain the links for hidden forums?
A. No, the forum permission was consulted while generating the sitemap files.

Q. How often are the sitemap files generated?
A. You decide and set in the Scheduled Tasks. The script can not be called by external user by default to prevent boring people killing your server.

Q. Is the sitemap file compressed.
A. Yes, the multiple sitemap files are gunziped according to Google sitemap standard to save bandwidth. Sitemap index file is not compressed, it is submitted as a normal xml file.

Q. Would the sitemaps include links for the normal threads? eg. showthread.php?t=1234...
A. No, it is unlikely Google will index your entire site if you feed it with all the combination of showthread links. It is better to let Google going through the more static archives. You will have a better chance for sure to have more thread contents indexed by Google this way.

Q. Why don't you go crazy about rewrite rules and do things like including thread title as the url.
A. I won't deny having keywords in the url is a good SEO strategy, but Google also does not like "Over Search Engine Optimized" web sites. Google has recently penalized a huge number of such sites. Sending them from page rank of 5, 6 to 0.

Q. Does sitemap really help?
A. Definitely, Google has done over 60,000 pages since I submitted my sitemaps a few days ago. Yahoo bots were visiting more pages than Google before the sitemap. I expect the total Google visits for this month will be exceeding Yahoo in the next one or two days.

What is involved?
==================
I have divided this hack into two steps. The first step involves unloading a php file. This enables the sitemap to be generated and submitted to Google.

The second step involves installing a Plugin using AdminCP. This sends all robots to the archive pages, preventing them viewing the actual threads.

For example, Google/Other Crawlers follows an external link to visit:
http://forums.mysite/showthread.php?t=1234&page=2

It will be told this page is permanently relocated to:
http://forums.mysite/archive/index.php/t-1234-p-2

This way you don't lose page rank gain from external links.

Install
=========
To install, follow the readme file.
To let me know you have installed this and let me send update information to you. Please click INSTALL .

Strategy
=========

It is unlikely Google/other Search Engine will index your entire site, especially due to the dynamic nature of the vbulletin forums. An archive sitemap will let Google concentrate on the real contents of your forums -- the threads. If Google needs to go through the endless member profile pages. It will get sick of it and just become tired.(sorry, perhaps robots can not become tired). What we can do is disallowing the crawling of unneccessary pages. My robots.txt contains:

#ALL BOTS
User-agent: *
Disallow: /admincp/
Disallow: /ajax.php
Disallow: /attachments/
Disallow: /clientscript/
Disallow: /cpstyles/
Disallow: /images/
Disallow: /includes/
Disallow: /install/
Disallow: /modcp/
Disallow: /subscriptions/
Disallow: /customavatars/
Disallow: /customprofilepics/
Disallow: /announcement.php
Disallow: /attachment.php
Disallow: /calendar.php
Disallow: /cron.php
Disallow: /editpost.php
Disallow: /external.php
Disallow: /faq.php
Disallow: /frm_attach
Disallow: /image.php
#Disallow: /index.php
Disallow: /inlinemod.php
Disallow: /joinrequests.php
Disallow: /login.php
Disallow: /member.php?
Disallow: /memberlist.php
Disallow: /misc.php
Disallow: /moderator.php
Disallow: /newattachment.php
Disallow: /newreply.php
Disallow: /newthread.php
Disallow: /online.php
Disallow: /payment_gateway.php
Disallow: /payments.php
Disallow: /poll.php
Disallow: /postings.php
Disallow: /printthread.php
Disallow: /private.php
Disallow: /profile.php
Disallow: /register.php
Disallow: /report.php
Disallow: /reputation.php
Disallow: /search.php
Disallow: /sendmessage.php
Disallow: /showgroups.php
Disallow: /showpost.php
Disallow: /subscription.php
Disallow: /usercp.php
Disallow: /threadrate.php
Disallow: /usercp.php
Disallow: /usernote.php

You perhaps have noticed I included index.php in there. Apparently Google regards http://forums.mysite/index.html as same as http://forums.mysite/
...but http://forums.mysite/index.php as a different file. The default vB templates include index.php as the internal link. That will spread your page rank on your home page! So it is better off not letting Google see this file.

If you have rewrite installed. Perhaps you could add to the .htaccess file:

RewriteCond %{QUERY_STRING} ^$
RewriteRule ^index.php$ / [R=301,L]

(if your forums are under http://site/forums/. Try: RewriteRule ^forums/index.php$ forums/ [R=301,L])

That will redirect /index.php to /, but only if no query_string is presented. ie. /index.php?do=mymod will not be redirected.

Show Your Support

  • This modification may not be copied, reproduced or published elsewhere without author's permission.

Comments
  #92  
Old 10-07-2005, 11:58 PM
lierduh lierduh is offline
 
Join Date: Jan 2003
Location: Sydney, Australia
Posts: 459
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

This is something I had in mind to implement. So next version will certainly contain this feature.

I think I should be able to push a new version out this weekend including better documentation for the step 3. I have been waiting for the vB Gold.

Quote:
Originally Posted by buro9
I have a problem though... you're making a sitemap gz for each forum, well, some of my forums are big:


Could you add spanning?

So we'd start with:
http://www.bowlie.com/forum/archive/sitemap_4_1.gz

And when we passed an arbitrary value (make it a setting in the file in case Google change it later) we would move onto:
http://www.bowlie.com/forum/archive/sitemap_4_2.gz
http://www.bowlie.com/forum/archive/sitemap_4_3.gz
through
http://www.bowlie.com/forum/archive/...p_4_9999999.gz
etc

As it stands, Google is now refusing to pay attention to my mine as the one that exceeds it basically causes the whole thing to error.
Reply With Quote
  #93  
Old 10-08-2005, 01:23 AM
Unreal Player Unreal Player is offline
 
Join Date: Jan 2005
Posts: 10
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

is it normal for my site to still be PENDING after 6 hours at google. And how does my site know what account i'm using to resubmit it automatically?
Reply With Quote
  #94  
Old 10-08-2005, 06:06 PM
dutchbb dutchbb is offline
 
Join Date: Nov 2003
Posts: 899
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

Quote:
Originally Posted by lierduh
This is something I had in mind to implement. So next version will certainly contain this feature.

I think I should be able to push a new version out this weekend including better documentation for the step 3. I have been waiting for the vB Gold.
HI

Google Spider still only looks at the normal threads in who's online?

Only the Yahoo! Slurp Spider looks at the archives?
Reply With Quote
  #95  
Old 10-08-2005, 11:06 PM
falter falter is offline
 
Join Date: Oct 2004
Posts: 24
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

Hi there,
I'm very happy with the archive redirection. That's pretty slick stuff, and it seems to be working great. The sitemap submission to google hasn't really taken effect quite yet, but it's only be 36 hours since submission (I imagine that these things can take some time). Yahoo is going bonkers on us, though!

Anyway, I've submitted a bug/feature request to vbulletin as a result of installing this mod. You can see it here:
http://www.vbulletin.com/forum/bugs3...iew&bugid=1576

Specifically, it has to do with the way in which $show['search_engine'] is defined, which seems important as it plays quite an important role in this particular mod.

Looking at the definition of $show['search_engine'] seemed important as I, like others, have noticed that sometimes googlebot doesn't want to get redirected from showthread to the archives.

(as seen in /includes/init.php)
Code:
$show['search_engine'] = ($vbulletin->superglobal_size['_COOKIE'] == 0 AND preg_match("#(google|msnbot|yahoo! slurp)#si", $_SERVER['HTTP_USER_AGENT']));
As you can see, the vBulletin assumes that no search engine spider will ever use a cookie. I found the redirection to be more effective after removing the checking for the absence of a cookie, which resulted in this:
Code:
$show['search_engine'] = (true AND preg_match("#(google|msnbot|yahoo! slurp)#si", $_SERVER['HTTP_USER_AGENT']));
Now, as you can see in my bug report, I'm not terribly satisfied with the way $show['search_engine'] is defined in the first place, but making the mod as seen above helped me out, some.

Hope this helps some of you guys...

~mike.
Reply With Quote
  #96  
Old 10-08-2005, 11:12 PM
falter falter is offline
 
Join Date: Oct 2004
Posts: 24
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

Quote:
Originally Posted by Triple_T
HI

Google Spider still only looks at the normal threads in who's online?

Only the Yahoo! Slurp Spider looks at the archives?
Triple_T,
Just for clarity's sake, I was having the same problem you are having. Try my mod (in the post above this one), and see if that helps.

~mike
Reply With Quote
  #97  
Old 10-09-2005, 02:00 AM
dutchbb dutchbb is offline
 
Join Date: Nov 2003
Posts: 899
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

Quote:
Originally Posted by falter
Triple_T,
Just for clarity's sake, I was having the same problem you are having. Try my mod (in the post above this one), and see if that helps.

~mike
I looked right after and 1 x google was in the archives. After that it was still also in the threads.

I noticed google is mutch less effective in comparison:
- only 1 spider most of the time (yahoo 10 or more)
- yahoo is now always in the archives, googlebot almost always not
- googlebot still goes to pages like printthread and member.php , and that even with a robot.txt disallowing that to happen.

MSN bot has not gone further than index.php, so looks like yahoo is just a better bot?

Now I have 2 questions regarding robots.txt:

- I have one both in the site root en the vbulletin root, is this needed , if not, what is the correct place (from what I have read it should be the site root)

- Is the .php extention needed for disallowing files, some say it's best to not include it, i have not seen a difference so far.
Reply With Quote
  #98  
Old 10-09-2005, 03:34 AM
jdingman jdingman is offline
 
Join Date: Jul 2005
Location: Canada
Posts: 126
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

Looks great so far. One question about mod_rewrite

using
Quote:
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^index.php$ / [R=301,L]
that redirects if you're using forums.domain.com. What about if you're using domain.com/forums/? What mod_rewrite would you use for that redirect?

(not exactly for me because I can probably get it working, but anyone else that might need this as well.)
Reply With Quote
  #99  
Old 10-09-2005, 03:51 AM
falter falter is offline
 
Join Date: Oct 2004
Posts: 24
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

Quote:
Originally Posted by Triple_T
Now I have 2 questions regarding robots.txt:

- I have one both in the site root en the vbulletin root, is this needed , if not, what is the correct place (from what I have read it should be the site root)

- Is the .php extention needed for disallowing files, some say it's best to not include it, i have not seen a difference so far.
your robots.txt should be accessible at the root of your domain (http://www.mydomain.com/robots.txt). this is the only place that spiders know to check.

if you're trying to explicitly define specific files (ex. /forums/showthread.php), then you should define that entry in your robots.txt file. there's no point in not putting the ".php" at the end (ex. /forums/showthread), it doesn't buy you anything. it can actually have a negative impact if your entries aren't defined well. say you're trying to tell search engines to ignore "/forum/s.php" (this is just hypothetical). if you were to just put "/forum/s" in your robots.txt, then, in addition to blocking "/forum/s.php", you'd be blocking "/forum/showthread.php", "/forum/search.php", "/forum/showgroups.php", anything else where the url starts with "/forum/s" .... as you can see, it's important to be as specific as possible, otherwise you risk shutting spiders out of huge chunks of your site.
Reply With Quote
  #100  
Old 10-09-2005, 03:59 AM
falter falter is offline
 
Join Date: Oct 2004
Posts: 24
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

Quote:
Originally Posted by Triple_T
I looked right after and 1 x google was in the archives. After that it was still also in the threads.

I noticed google is mutch less effective in comparison:
- only 1 spider most of the time (yahoo 10 or more)
- yahoo is now always in the archives, googlebot almost always not
- googlebot still goes to pages like printthread and member.php , and that even with a robot.txt disallowing that to happen.
i've thought about it some more.
301 code just tells the bot that the link has permanently moved. it would take a second request from the spider to actually jump to the archives. if the spider is slow (as googlebot and msnbot typically are), i can see how it would appear as though googlebot was sitting in showthread, instead of being directed to the archive....
Reply With Quote
  #101  
Old 10-09-2005, 05:42 AM
lierduh lierduh is offline
 
Join Date: Jan 2003
Location: Sydney, Australia
Posts: 459
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

I have a new version ready to be released. If anyone wants, you can download this and try out before I put together the package.

I still need to do the documentation for the modifications of index.php and global.php files.
Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT. The time now is 01:02 AM.


Powered by vBulletin® Version 3.8.12 by vBS
Copyright ©2000 - 2024, vBulletin Solutions Inc.
X vBulletin 3.8.12 by vBS Debug Information
  • Page Generation 0.04659 seconds
  • Memory Usage 2,337KB
  • Queries Executed 25 (?)
More Information
Template Usage:
  • (1)SHOWTHREAD
  • (1)ad_footer_end
  • (1)ad_footer_start
  • (1)ad_header_end
  • (1)ad_header_logo
  • (1)ad_navbar_below
  • (1)ad_showthread_beforeqr
  • (2)bbcode_code
  • (7)bbcode_quote
  • (1)footer
  • (1)forumjump
  • (1)forumrules
  • (1)gobutton
  • (1)header
  • (1)headinclude
  • (1)modsystem_post
  • (1)navbar
  • (6)navbar_link
  • (120)option
  • (1)pagenav
  • (1)pagenav_curpage
  • (4)pagenav_pagelink
  • (1)pagenav_pagelinkrel
  • (11)post_thanks_box
  • (11)post_thanks_button
  • (1)post_thanks_javascript
  • (1)post_thanks_navbar_search
  • (11)post_thanks_postbit_info
  • (10)postbit
  • (11)postbit_onlinestatus
  • (11)postbit_wrapper
  • (1)spacer_close
  • (1)spacer_open
  • (1)tagbit_wrapper 

Phrase Groups Available:
  • global
  • inlinemod
  • postbit
  • posting
  • reputationlevel
  • showthread
Included Files:
  • ./showthread.php
  • ./global.php
  • ./includes/init.php
  • ./includes/class_core.php
  • ./includes/config.php
  • ./includes/functions.php
  • ./includes/class_hook.php
  • ./includes/modsystem_functions.php
  • ./includes/functions_bigthree.php
  • ./includes/class_postbit.php
  • ./includes/class_bbcode.php
  • ./includes/functions_reputation.php
  • ./includes/functions_post_thanks.php 

Hooks Called:
  • init_startup
  • init_startup_session_setup_start
  • init_startup_session_setup_complete
  • cache_permissions
  • fetch_threadinfo_query
  • fetch_threadinfo
  • fetch_foruminfo
  • style_fetch
  • cache_templates
  • global_start
  • parse_templates
  • global_setup_complete
  • showthread_start
  • showthread_getinfo
  • forumjump
  • showthread_post_start
  • showthread_query_postids
  • showthread_query
  • bbcode_fetch_tags
  • bbcode_create
  • showthread_postbit_create
  • postbit_factory
  • postbit_display_start
  • post_thanks_function_post_thanks_off_start
  • post_thanks_function_post_thanks_off_end
  • post_thanks_function_fetch_thanks_start
  • post_thanks_function_fetch_thanks_end
  • post_thanks_function_thanked_already_start
  • post_thanks_function_thanked_already_end
  • fetch_musername
  • postbit_imicons
  • bbcode_parse_start
  • bbcode_parse_complete_precache
  • bbcode_parse_complete
  • postbit_display_complete
  • post_thanks_function_can_thank_this_post_start
  • pagenav_page
  • pagenav_complete
  • tag_fetchbit_complete
  • forumrules
  • navbits
  • navbits_complete
  • showthread_complete