Go Back   vb.org Archive > vBulletin 3 Discussion > vB3 General Discussions
FAQ Community Calendar Today's Posts Search

Reply
 
Thread Tools Display Modes
  #1  
Old 02-11-2006, 11:48 PM
forumdude's Avatar
forumdude forumdude is offline
 
Join Date: Nov 2001
Posts: 50
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default Post archiving proposal

We get around 20-25K posts per day on our board and as a result have over 18 million rows in the post table. Whenever someone opens an old thread with hundreds of pages it does a number on our database server. The filesorts are slowing the whole setup down. As a result of this I've come up with a method to archive posts and would like to outline it and get feedback from other large boards or people who would have insight or advice.


The idea is to have a post table that has about the last 6 months of posts from active threads. Once a thread hasn't been posted in for over 6 months and it's not a sticky then its posts gets archived and the thread marked as such. Archived threads cannot be altered once they're archived.

Here's what I've got so far:


DATABASE

Table Name
post: a live table with somewhere around the last 6 months of posts. This is the table that will continue to take all the writes and new information. Some old threads that should stay writeable such as feedback and stickies will continue to be in this table as well.

post_merged: this is the MERGE table that will sit "on top" of the post_x tables listed below.

post_1 - post_n: identically structured tables that will hold around 3.5 million rows each. populated with posts that are grouped by their threadids. no posts from one thread should span across multiple post_x tables.

thread: a column on thread called 'archived' will be added to hold a boolean. this tells showthread where to look for the posts.


The MERGE engine
Pros of MERGE: After doing some testing it would appear that we don't need to specify exactly which table has the postids for the thread. I threw identical queries at the post_x table as well as the post_merge table. Each time I cleared the query cache to get an accurate time. The result times were very similar so we can probably just throw queries at the top 'post_merged' instead of needing to keep record of exactly which table has what thread's posts. This also makes inserting easier as the system won't have to keep track of the last table which is the one that will get the new inserts.

Cons of MERGE: "MERGE tables use more file descriptors. If 10 clients are using a MERGE table that maps to 10 tables, the server uses (10 ? 10) + 10 file descriptors. (10 data file descriptors for each of the 10 clients, and 10 index file descriptors shared among the clients.)." "Key reads are slower. When you read a key, the MERGE storage engine needs to issue a read on all underlying tables to check which one most closely matches the given key. To read the next key, the MERGE storage engine needs to search the read buffers to find the next key. Only when one key buffer is used up does the storage engine need to read the next key block. This makes MERGE keys much slower on eq_ref searches, but not much slower on ref searches." I do not know how badly those will affect this and would appreciate any input from others about it.


FRONT END

showthread.php: verify_id() should return $threadinfo with the 'archived' record from the thread table. If that thread's posts are in the archive then simply change the query to select from 'post_merged' rather than 'post'.

misc other places: disabling all ability to allow someone to write or change information regarding an archived thread/posts. though we should give admins the ability to ressurect threads.


ETC

Initial population: This will probably have to be a shell script (or run php from command line) due to the length of the process (assuming 2 hours on INSERTs). We have to decide how many tables we need to maintain the avg number of posts per each. We need to be able to tell the thread table which posts are going to be in the archive. We need to be able to group posts by threadid as they are entered and not have them span across multiple tables.


Maintenance: A cron script will need to check the post table nightly or weekly and look for threads that have not been posted in during the specified amount of time. If the thread meets the criteria then all of the posts are inserted into post_merge and the records in the 'post' table are deleted. The thread's archived record in the thread table is changed to 1. Put logic in to add another post_x table and have the post_merge table remapped when the last post_x table is over something like 3.5 million rows. May want to prune out archived threadid's from the subscriptions, though this action cannot be reversed.




Any thoughts?
Reply With Quote
  #2  
Old 02-12-2006, 04:39 AM
Nomb's Avatar
Nomb Nomb is offline
 
Join Date: Nov 2002
Location: Phoenix, AZ
Posts: 85
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

I've been thinking about this lately and it's in my project queue to address sometime this year.

I think you have a decent spec outlined, but I think you should consider the archive storage engine. There's a decent article about it here:

http://dev.mysql.com/tech-resources/...ge-engine.html

and you can find the reference manual details here:

http://dev.mysql.com/doc/refman/5.0/...ge-engine.html

At this moment I can't say whether it's a good way to go or not, I need to do more research myself.

Also, I think you should reconsider how you're going to deal with what's archived and what's not. There's really no reason to store ANY data in your live tables about what has been archived. Rather, I think you should simply maintain another domain for your archived data and redirect there. For instance, the logic could go:

Request: http://forum.example.com/showthread.php?t=77

Process: If thread id 77 exists, show thread. Else, 301 redirect:

Request: http://archiveforum.example.com/showthread.php?t=77

Process: If thread id 77 exists, show thread. Else, issue 404.

Although there isn't much of a difference resource wise in your live env, at least you don't have to maintain any additional data there. In essence, your live env is autonomous from your archive env, the live env only sends requests to it when it cannot fulfill them.

Anyway, there is some of my humble feedback and I look forward to participating in this discussion moving forward. I won't necessarily get "head down" on this until later this year, but in the mean time I'd love to be involved and help in any way I can and I applaud you for starting a discussion.

Thanks
Reply With Quote
  #3  
Old 02-12-2006, 08:00 AM
Marco van Herwaarden Marco van Herwaarden is offline
 
Join Date: Jul 2004
Posts: 25,415
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

The biggest problem with any archive solution would be how to include it in a search and don't loose performance.
Reply With Quote
  #4  
Old 02-12-2006, 04:25 PM
AWS's Avatar
AWS AWS is offline
 
Join Date: Nov 2001
Location: Joliet, IL
Posts: 235
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

Archiving into the db is not the way to go. The archive should be static html files and dirs. The idea is to take some load off mysql. The only way to do that is move old threads out of it.
In any case something has to be done.
Reply With Quote
  #5  
Old 02-12-2006, 07:15 PM
forumdude's Avatar
forumdude forumdude is offline
 
Join Date: Nov 2001
Posts: 50
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

Quote:
Originally Posted by MarcoH64
The biggest problem with any archive solution would be how to include it in a search and don't loose performance.
Currently we only allow searches on the past 4 months, which is a separate table on a slave database machine used only for searching. In the scenario above we'd have to rely on other search engines. It sucks but it isn't any worse than what we're already doing, heh.




Also, I forgot to mention in the original post that one of the reasons I want to keep the data in the database itself is to be able to easily revert what I've done should vBulletin ever make their own archival method. With this setup you simply throw all of the archive tables back into the post.
Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT. The time now is 08:36 PM.


Powered by vBulletin® Version 3.8.12 by vBS
Copyright ©2000 - 2025, vBulletin Solutions Inc.
X vBulletin 3.8.12 by vBS Debug Information
  • Page Generation 0.03469 seconds
  • Memory Usage 2,203KB
  • Queries Executed 11 (?)
More Information
Template Usage:
  • (1)SHOWTHREAD
  • (1)ad_footer_end
  • (1)ad_footer_start
  • (1)ad_header_end
  • (1)ad_header_logo
  • (1)ad_navbar_below
  • (1)ad_showthread_beforeqr
  • (1)ad_showthread_firstpost
  • (1)ad_showthread_firstpost_sig
  • (1)ad_showthread_firstpost_start
  • (1)bbcode_quote
  • (1)footer
  • (1)forumjump
  • (1)forumrules
  • (1)gobutton
  • (1)header
  • (1)headinclude
  • (1)navbar
  • (3)navbar_link
  • (120)option
  • (5)post_thanks_box
  • (5)post_thanks_button
  • (1)post_thanks_javascript
  • (1)post_thanks_navbar_search
  • (5)post_thanks_postbit_info
  • (5)postbit
  • (5)postbit_onlinestatus
  • (5)postbit_wrapper
  • (1)spacer_close
  • (1)spacer_open
  • (1)tagbit_wrapper 

Phrase Groups Available:
  • global
  • inlinemod
  • postbit
  • posting
  • reputationlevel
  • showthread
Included Files:
  • ./showthread.php
  • ./global.php
  • ./includes/init.php
  • ./includes/class_core.php
  • ./includes/config.php
  • ./includes/functions.php
  • ./includes/class_hook.php
  • ./includes/modsystem_functions.php
  • ./includes/functions_bigthree.php
  • ./includes/class_postbit.php
  • ./includes/class_bbcode.php
  • ./includes/functions_reputation.php
  • ./includes/functions_post_thanks.php 

Hooks Called:
  • init_startup
  • init_startup_session_setup_start
  • init_startup_session_setup_complete
  • cache_permissions
  • fetch_threadinfo_query
  • fetch_threadinfo
  • fetch_foruminfo
  • style_fetch
  • cache_templates
  • global_start
  • parse_templates
  • global_setup_complete
  • showthread_start
  • showthread_getinfo
  • forumjump
  • showthread_post_start
  • showthread_query_postids
  • showthread_query
  • bbcode_fetch_tags
  • bbcode_create
  • showthread_postbit_create
  • postbit_factory
  • postbit_display_start
  • post_thanks_function_post_thanks_off_start
  • post_thanks_function_post_thanks_off_end
  • post_thanks_function_fetch_thanks_start
  • post_thanks_function_fetch_thanks_end
  • post_thanks_function_thanked_already_start
  • post_thanks_function_thanked_already_end
  • fetch_musername
  • postbit_imicons
  • bbcode_parse_start
  • bbcode_parse_complete_precache
  • bbcode_parse_complete
  • postbit_display_complete
  • post_thanks_function_can_thank_this_post_start
  • tag_fetchbit_complete
  • forumrules
  • navbits
  • navbits_complete
  • showthread_complete