vb.org Archive

vb.org Archive (https://vborg.vbsupport.ru/index.php)
-   vBulletin 2.x Full Releases (https://vborg.vbsupport.ru/forumdisplay.php?f=4)
-   -   vbSpiderFriend - Search Engine Friendliness (https://vborg.vbsupport.ru/showthread.php?t=15628)

Brian 02-21-2002 12:05 AM

Hello,

I am writing to see if some one would be interested in taking this hack one step further to allow for the creation of html pages for search engines to index. This would prevent problems of some search engines overloading a server if trying to index they dynamic content to fast.

What I propose, is for a script to make pages identical to what we have with this script, however it actually makes the html page like what is dynamic and puts them into a similar folder structure.

The script would need to be able to cycle through all of the posts initially so it doesn’t cause problems doing them all at once, and it should then be run via a cron job or manually every so often to archive new posts, or re archive edited posts since the last run.

I feel this would meet a lot of the needs of sites on shared servers, and if need be would be willing to pay for this to be done.

Let me know if anyone is interested.

-Brian

JJR512 02-21-2002 01:15 AM

Quote:

Originally posted by eva2000
had to remove this hack as it allowed people to snoop in to private forums via entering forum id numbers which were not displayed (invisible) on the page listings :(
This doesn't happen for me. I just logged out of my board and tried putting in some private forum ID numbers. In all cases, even though the URL in the address bar showed the private forumid, the page actually went to forumid 1. Regardless of where I started, if I put in a private forumid number, it went to the first forum.

buro9 02-22-2002 07:06 AM

Quote:

Originally posted by Brian
Hello,

I am writing to see if some one would be interested in taking this hack one step further to allow for the creation of html pages for search engines to index. This would prevent problems of some search engines overloading a server if trying to index they dynamic content to fast.

What I propose, is for a script to make pages identical to what we have with this script, however it actually makes the html page like what is dynamic and puts them into a similar folder structure.

The script would need to be able to cycle through all of the posts initially so it doesn?t cause problems doing them all at once, and it should then be run via a cron job or manually every so often to archive new posts, or re archive edited posts since the last run.

I feel this would meet a lot of the needs of sites on shared servers, and if need be would be willing to pay for this to be done.

Let me know if anyone is interested.

-Brian

I already have what we've called a Cache Cannon on one of our sites.
All it does is whip through the database, and for all search query results for a particular query it will cannon hundreds of small files onto the docroot.

Now, we use this to pre-generate content on our site, thus massively reducing the database hits for the dynamic content (very important for us, we get over 100,000 unique users per hour on our top content sites).

Once a day it is fired and the site is made fresh. News is fired hourly or manually when needed.

In our application it's good for security too, since the database resides on a different machine and no access is needed by the webserver (the Cache Cannon resides on an interim machine and simply copies the files to the web server).

...anyhow, yesterday when I saw this thread I realised that the main flaw is that it is too slow in generating the content for the spiders. That the spiders would prefer static html so they can trawl faster, and that the pages were not optimised enough for a high ranking on the search engine. Also you probably get hit by several spiders a day (take a peek at your logs and requests for \robots.txt for an indication), and the work to pre-generate is probably less than the work to serve it all each time.

Thus I will probably be making another implementation of this hack 100% new, but based upon our existing Cache Cannon theory.

It will create a single html file for each post, and you could fire it for given date ranges (reducing server load) or forums, at given time intervals (manually, say weekly) or via a cron job.

I shall also include client side javascript in these files to redirect a user to the proper version of the post in the appropriate forum onload. This should be googlebot safe as I believe it ignores client side script, but will ensure that when a user comes from a search engine, they are simply bounced to the correct entry in the real forum.

eva2000, I shall endeavour to make sure that this does not generate files for private forums. This will be perfect for you since entering private forum id's would not be possible, since the files are static. Though it should be noted that as this will generate static files... should you later turn a public forum private then you would have to delete those files manually, hence including the $forumId in the proposed folder structure.

Proposed storage:

The folder structure...

$forumpath/archive/$forumId/$year/$month/$day

For the file names...

$postId.htm

I shall start this on Tuesday next week, and hope to have it finished by Saturday next week (I'd do it sooner, but it's my birthday and this isn't that important!).

The files will be standalone and I shall develop them with vb v2.2.2 though as I shall only be accessing user, post, thread and forum (I guess... I'll have to look at the schema) this should be backwards compatible to at least 2.x boards. Though I will only be supporting the latest version at any time.

If I run into trouble or need assistance with the schema I shall let you know.

Cheers

David K

http://www.buro9.com/forum/

Brian 02-27-2002 01:03 AM

I just wanted to touch base to see if you had yet worked on this.

-Brian

buro9 02-27-2002 09:18 AM

Started to put the basics in place last night whilst building another PC.

Got reasonably far with the function that will dump files on the docroot. Just subject to load testing for that.

Also built the query that will extract all posts for dumping... have been working on this to make sure it excludes private forums... need to install foxserv at home to test this on a test forum.

I think I shall have the back end fully over the evenings this week. The front end will be the thing that actually takes time, but I'm hoping that's just gonna be Saturday and not need more work. Problems stem from wanting to decrease server load by breaking the generation into managable amounts (monthly or by forum).

I'll let you know when I have something substantial, and then we can start a new thread here for discussion whilst it's improved.

I think it's realistic to say it should be ready as a Beta over this weekend, and that a Release version should follow next week once everyone is sure that it does what they want it to (though I'll not be including code to toast bread).

Cheers

David K

Brian 02-27-2002 12:43 PM

Wow cant wait!

rawnet 03-05-2002 12:30 PM

How did this go Buro? I'm looking for a solution like this which also works on Win2k (withou htaccess, etc). Did a search for Cache Cannon as well but couldn't find it?

buro9 03-05-2002 02:02 PM

I've e-mailed Brian offline about this, but in essence it's built.

What I have thus far is an adequate interface offering caching for:

All Forums
Specific Forums
Specific Forms + Sub Forums
Within the past x days.

The Cache Cannon then will fire for all applicable posts, and uses a template to render the display in the html files.

The only missing thing is the final parsing through all resultant folders and files, constructing the index.htm files that will tie it all together for the spiders... and I have plans on the best way to do this already.

Awaiting feedback from Brian, but if you wish I can send you an example of the current code tonight and you can offer your comments on how to progress it.

I do not want to release it until it works fully on the backend, I'm not bothered by cosmetic things at the moment (since that will be template driven and user adjustable), just that it all works a dream... if you wish to be a private beta tester and help me push it forward, then get in contact.

Cheers

David K

Brian 03-05-2002 02:22 PM

It all sounds great! Cant wait to test it out :)

If its ready to test, you can email me at Brian@FutureQuest.net

-Brian

Brian 03-09-2002 12:53 AM

I just wanted to follow up to see if you have a version available for us to download yet.

Thanks,
Brian


All times are GMT. The time now is 09:47 PM.

Powered by vBulletin® Version 3.8.12 by vBS
Copyright ©2000 - 2025, vBulletin Solutions Inc.

X vBulletin 3.8.12 by vBS Debug Information
  • Page Generation 0.01272 seconds
  • Memory Usage 1,757KB
  • Queries Executed 10 (?)
More Information
Template Usage:
  • (1)ad_footer_end
  • (1)ad_footer_start
  • (1)ad_header_end
  • (1)ad_header_logo
  • (1)ad_navbar_below
  • (2)bbcode_quote_printable
  • (1)footer
  • (1)gobutton
  • (1)header
  • (1)headinclude
  • (6)option
  • (1)pagenav
  • (1)pagenav_curpage
  • (4)pagenav_pagelink
  • (2)pagenav_pagelinkrel
  • (1)post_thanks_navbar_search
  • (1)printthread
  • (10)printthreadbit
  • (1)spacer_close
  • (1)spacer_open 

Phrase Groups Available:
  • global
  • postbit
  • showthread
Included Files:
  • ./printthread.php
  • ./global.php
  • ./includes/init.php
  • ./includes/class_core.php
  • ./includes/config.php
  • ./includes/functions.php
  • ./includes/class_hook.php
  • ./includes/modsystem_functions.php
  • ./includes/class_bbcode_alt.php
  • ./includes/class_bbcode.php
  • ./includes/functions_bigthree.php 

Hooks Called:
  • init_startup
  • init_startup_session_setup_start
  • init_startup_session_setup_complete
  • cache_permissions
  • fetch_threadinfo_query
  • fetch_threadinfo
  • fetch_foruminfo
  • style_fetch
  • cache_templates
  • global_start
  • parse_templates
  • global_setup_complete
  • printthread_start
  • pagenav_page
  • pagenav_complete
  • bbcode_fetch_tags
  • bbcode_create
  • bbcode_parse_start
  • bbcode_parse_complete_precache
  • bbcode_parse_complete
  • printthread_post
  • printthread_complete