View Single Post
  #1  
Old 09-22-2002, 04:00 AM
sparky2 sparky2 is offline
 
Join Date: Jul 2002
Posts: 31
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default something "ain't right" with vBulletin's searchindex code

I've posted this over at the vbulletin.com site,
in a thread titled "search feature: ongoing questions and worries"
but felt compelled to also post it here.



I'm fairly convinced that something is wrong with the vBulletin search algorithm (as of v2.2.7) and/or the underlying db schema, and offer the following as "proof":

During the past 2 months, I've tweaked our MySQL conf, php.ini,
and Apache httpd.conf countless times, trying to eliminate
anything else which might be a/the limiting factor... and it all
comes back to (down to) continual I/O bottlenecks from MySQL
threads reading vBulletin's halfGig+ searchindex.MYD and related
tables, with everything queueing up behind them.

THE SEARCHINDEX SHOULD NOT BE THIS LARGE.
-=-
SOMETHING _MUST_ BE WRONG WITH THE REGEX // PREG LOGIC USED IN BUILDING THE SEARCH INDEX!


I'm currently rebuilding the search index (for the nth time).
After emptying the prior (via the vbb adminCP interface)
and before rebuilding, I watched the WORD table begin
filling realtime ~~ looking for words I might want to define
as badwords array elements.

Hooboy! I went away for 12hrs and came back to find 560,000 records in the WORD table (and that about a third of our 2million posts had been indexed at that point). "There aren't that many words in the English language, dammit!" (I muttered) and started poking for answers.

I paged through the first 10,000 (or so) entries in the WORD table
e.g. SELECT title FROM word where wordid>1000 and wordid<2000;
and they were all CRAP ~~ non-words, with leading "punctuation" chars
( $something, &something, %something, ...something, *something, 45bucks,and even a.m )
!!!

"Oh, but these won't ALL actually be referenced in the searchindex table, right?!?" I wondered.
WRONG!

mysql> select distinct wordid from searchindex order by wordid limit 500;

I picked several to check, f'rinstance searchindex.wordid = 2737
and then double-checked (xrefed) them in the WORD table:

mysql> select wordid,title from word where title like '$%' limit 500;
+--------+-----------------+
| wordid | title |
+--------+-----------------+
| 2737 | $$ |
| 5639 | $$$ |
| 14701 | $$$$ |
| 14613 | $$$$$ |
{ snip }
| 215960 | $$$$$$$4 |
| 568393 | $$$$$$.tia |
| 219585 | $$$$$$s |
| 571997 | $$$$$...help |
| 194704 | $$$$$110.00 |
{ snip }

Not only is this "non-word" IN THE SEARCHINDEX, it's referenced THOUSANDS of times!, vis:
select count(wordid) from searchindex where wordid=2737;
+---------------+
| count(wordid) |
+---------------+
| 2477 |
+---------------+

What I'm saying (accusing?) is that although search phrases are being explode(ed) into words ~~ and punctuation stripped ~~ at runtime, NON [A-z][0-9] CHARS ARE CLEARLY NOT BEING STRIPPED (ref: vbb.word.title) WHEN THE SEARCH INDEX IS BUILT!

The 900 (or so) "badwords" in our list ARE being excluded, but these exclusions are tip-of-the-iceberg-insignificant in the scheme of things here.

Maybe the REGEX is okay. Maybe addslashes(htmlspecialcharacters(searchTerm) just needs to be called later, or stripslashes-ed before being checked? Whatever.

PLEASE don't put off fixing this until v3.0
This problem has kept many of us chasing our tails for months!

bottom line:
The scalability of the vBulletin platform DEPENDS on overcoming this problem!
Reply With Quote
 
X vBulletin 3.8.12 by vBS Debug Information
  • Page Generation 0.01038 seconds
  • Memory Usage 1,780KB
  • Queries Executed 11 (?)
More Information
Template Usage:
  • (1)SHOWTHREAD_SHOWPOST
  • (1)ad_footer_end
  • (1)ad_footer_start
  • (1)ad_header_end
  • (1)ad_header_logo
  • (1)ad_navbar_below
  • (1)footer
  • (1)gobutton
  • (1)header
  • (1)headinclude
  • (6)option
  • (1)post_thanks_box
  • (1)post_thanks_button
  • (1)post_thanks_javascript
  • (1)post_thanks_navbar_search
  • (1)post_thanks_postbit_info
  • (1)postbit
  • (1)postbit_onlinestatus
  • (1)postbit_wrapper
  • (1)spacer_close
  • (1)spacer_open 

Phrase Groups Available:
  • global
  • postbit
  • reputationlevel
  • showthread
Included Files:
  • ./showpost.php
  • ./global.php
  • ./includes/init.php
  • ./includes/class_core.php
  • ./includes/config.php
  • ./includes/functions.php
  • ./includes/class_hook.php
  • ./includes/modsystem_functions.php
  • ./includes/functions_bigthree.php
  • ./includes/class_postbit.php
  • ./includes/class_bbcode.php
  • ./includes/functions_reputation.php
  • ./includes/functions_post_thanks.php 

Hooks Called:
  • init_startup
  • init_startup_session_setup_start
  • init_startup_session_setup_complete
  • cache_permissions
  • fetch_postinfo_query
  • fetch_postinfo
  • fetch_threadinfo_query
  • fetch_threadinfo
  • fetch_foruminfo
  • style_fetch
  • cache_templates
  • global_start
  • parse_templates
  • global_setup_complete
  • showpost_start
  • bbcode_fetch_tags
  • bbcode_create
  • postbit_factory
  • showpost_post
  • postbit_display_start
  • post_thanks_function_post_thanks_off_start
  • post_thanks_function_post_thanks_off_end
  • post_thanks_function_fetch_thanks_start
  • post_thanks_function_fetch_thanks_end
  • post_thanks_function_thanked_already_start
  • post_thanks_function_thanked_already_end
  • fetch_musername
  • postbit_imicons
  • bbcode_parse_start
  • bbcode_parse_complete_precache
  • bbcode_parse_complete
  • postbit_display_complete
  • post_thanks_function_can_thank_this_post_start
  • showpost_complete