vb.org Archive - View Single Post - something "ain't right" with vBulletin's searchindex code

sparky2 · #1 09-22-2002, 04:00 AM

I've posted this over at the vbulletin.com site,
in a thread titled "search feature: ongoing questions and worries"
but felt compelled to also post it here.

I'm fairly convinced that something is wrong with the vBulletin search algorithm (as of v2.2.7) and/or the underlying db schema, and offer the following as "proof":

During the past 2 months, I've tweaked our MySQL conf, php.ini,
and Apache httpd.conf countless times, trying to eliminate
anything else which might be a/the limiting factor... and it all
comes back to (down to) continual I/O bottlenecks from MySQL
threads reading vBulletin's halfGig+ searchindex.MYD and related
tables, with everything queueing up behind them.

THE SEARCHINDEX SHOULD NOT BE THIS LARGE.
-=-
SOMETHING _MUST_ BE WRONG WITH THE REGEX // PREG LOGIC USED IN BUILDING THE SEARCH INDEX!

I'm currently rebuilding the search index (for the nth time).
After emptying the prior (via the vbb adminCP interface)
and before rebuilding, I watched the WORD table begin
filling realtime ~~ looking for words I might want to define
as badwords array elements.

Hooboy! I went away for 12hrs and came back to find 560,000 records in the WORD table (and that about a third of our 2million posts had been indexed at that point). "There aren't that many words in the English language, dammit!" (I muttered) and started poking for answers.

I paged through the first 10,000 (or so) entries in the WORD table
e.g. SELECT title FROM word where wordid>1000 and wordid<2000;
and they were all CRAP ~~ non-words, with leading "punctuation" chars
( $something, &something, %something, ...something, *something, 45bucks,and even a.m )
!!!

"Oh, but these won't ALL actually be referenced in the searchindex table, right?!?" I wondered.
WRONG!

mysql> select distinct wordid from searchindex order by wordid limit 500;

I picked several to check, f'rinstance searchindex.wordid = 2737
and then double-checked (xrefed) them in the WORD table:

mysql> select wordid,title from word where title like '$%' limit 500;
+--------+-----------------+
| wordid | title |
+--------+-----------------+
| 2737 | $$ |
| 5639 | $$$ |
| 14701 | $$$$ |
| 14613 | $$$$$ |
{ snip }
| 215960 | $$$$$$$4 |
| 568393 | $$$$$$.tia |
| 219585 | $$$$$$s |
| 571997 | $$$$$...help |
| 194704 | $$$$$110.00 |
{ snip }

Not only is this "non-word" IN THE SEARCHINDEX, it's referenced THOUSANDS of times!, vis:
select count(wordid) from searchindex where wordid=2737;
+---------------+
| count(wordid) |
+---------------+
| 2477 |
+---------------+

What I'm saying (accusing?) is that although search phrases are being explode(ed) into words ~~ and punctuation stripped ~~ at runtime, NON [A-z][0-9] CHARS ARE CLEARLY NOT BEING STRIPPED (ref: vbb.word.title) WHEN THE SEARCH INDEX IS BUILT!

The 900 (or so) "badwords" in our list ARE being excluded, but these exclusions are tip-of-the-iceberg-insignificant in the scheme of things here.

Maybe the REGEX is okay. Maybe addslashes(htmlspecialcharacters(searchTerm) just needs to be called later, or stripslashes-ed before being checked? Whatever.

PLEASE don't put off fixing this until v3.0
This problem has kept many of us chasing our tails for months!

bottom line:
The scalability of the vBulletin platform DEPENDS on overcoming this problem!

X vBulletin 3.8.12 by vBS Debug Information
Page Generation 0.01038 seconds Memory Usage 1,780KB Queries Executed 11 (?)
More Information
Template Usage: (1)SHOWTHREAD_SHOWPOST (1)ad_footer_end (1)ad_footer_start (1)ad_header_end (1)ad_header_logo (1)ad_navbar_below (1)footer (1)gobutton (1)header (1)headinclude (6)option (1)post_thanks_box (1)post_thanks_button (1)post_thanks_javascript (1)post_thanks_navbar_search (1)post_thanks_postbit_info (1)postbit (1)postbit_onlinestatus (1)postbit_wrapper (1)spacer_close (1)spacer_open Phrase Groups Available: global postbit reputationlevel showthread	Included Files: ./showpost.php ./global.php ./includes/init.php ./includes/class_core.php ./includes/config.php ./includes/functions.php ./includes/class_hook.php ./includes/modsystem_functions.php ./includes/functions_bigthree.php ./includes/class_postbit.php ./includes/class_bbcode.php ./includes/functions_reputation.php ./includes/functions_post_thanks.php Hooks Called: init_startup init_startup_session_setup_start init_startup_session_setup_complete cache_permissions fetch_postinfo_query fetch_postinfo fetch_threadinfo_query fetch_threadinfo fetch_foruminfo style_fetch cache_templates global_start parse_templates global_setup_complete showpost_start bbcode_fetch_tags bbcode_create postbit_factory showpost_post postbit_display_start post_thanks_function_post_thanks_off_start post_thanks_function_post_thanks_off_end post_thanks_function_fetch_thanks_start post_thanks_function_fetch_thanks_end post_thanks_function_thanked_already_start post_thanks_function_thanked_already_end fetch_musername postbit_imicons bbcode_parse_start bbcode_parse_complete_precache bbcode_parse_complete postbit_display_complete post_thanks_function_can_thank_this_post_start showpost_complete
Messages: