Go Back   vb.org Archive > Community Central > vBulletin.org Site Feedback
FAQ Community Calendar Today's Posts Search

Reply
 
Thread Tools Display Modes
  #1  
Old 09-22-2002, 04:00 AM
sparky2 sparky2 is offline
 
Join Date: Jul 2002
Posts: 31
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default something "ain't right" with vBulletin's searchindex code

I've posted this over at the vbulletin.com site,
in a thread titled "search feature: ongoing questions and worries"
but felt compelled to also post it here.



I'm fairly convinced that something is wrong with the vBulletin search algorithm (as of v2.2.7) and/or the underlying db schema, and offer the following as "proof":

During the past 2 months, I've tweaked our MySQL conf, php.ini,
and Apache httpd.conf countless times, trying to eliminate
anything else which might be a/the limiting factor... and it all
comes back to (down to) continual I/O bottlenecks from MySQL
threads reading vBulletin's halfGig+ searchindex.MYD and related
tables, with everything queueing up behind them.

THE SEARCHINDEX SHOULD NOT BE THIS LARGE.
-=-
SOMETHING _MUST_ BE WRONG WITH THE REGEX // PREG LOGIC USED IN BUILDING THE SEARCH INDEX!


I'm currently rebuilding the search index (for the nth time).
After emptying the prior (via the vbb adminCP interface)
and before rebuilding, I watched the WORD table begin
filling realtime ~~ looking for words I might want to define
as badwords array elements.

Hooboy! I went away for 12hrs and came back to find 560,000 records in the WORD table (and that about a third of our 2million posts had been indexed at that point). "There aren't that many words in the English language, dammit!" (I muttered) and started poking for answers.

I paged through the first 10,000 (or so) entries in the WORD table
e.g. SELECT title FROM word where wordid>1000 and wordid<2000;
and they were all CRAP ~~ non-words, with leading "punctuation" chars
( $something, &something, %something, ...something, *something, 45bucks,and even a.m )
!!!

"Oh, but these won't ALL actually be referenced in the searchindex table, right?!?" I wondered.
WRONG!

mysql> select distinct wordid from searchindex order by wordid limit 500;

I picked several to check, f'rinstance searchindex.wordid = 2737
and then double-checked (xrefed) them in the WORD table:

mysql> select wordid,title from word where title like '$%' limit 500;
+--------+-----------------+
| wordid | title |
+--------+-----------------+
| 2737 | $$ |
| 5639 | $$$ |
| 14701 | $$$$ |
| 14613 | $$$$$ |
{ snip }
| 215960 | $$$$$$$4 |
| 568393 | $$$$$$.tia |
| 219585 | $$$$$$s |
| 571997 | $$$$$...help |
| 194704 | $$$$$110.00 |
{ snip }

Not only is this "non-word" IN THE SEARCHINDEX, it's referenced THOUSANDS of times!, vis:
select count(wordid) from searchindex where wordid=2737;
+---------------+
| count(wordid) |
+---------------+
| 2477 |
+---------------+

What I'm saying (accusing?) is that although search phrases are being explode(ed) into words ~~ and punctuation stripped ~~ at runtime, NON [A-z][0-9] CHARS ARE CLEARLY NOT BEING STRIPPED (ref: vbb.word.title) WHEN THE SEARCH INDEX IS BUILT!

The 900 (or so) "badwords" in our list ARE being excluded, but these exclusions are tip-of-the-iceberg-insignificant in the scheme of things here.

Maybe the REGEX is okay. Maybe addslashes(htmlspecialcharacters(searchTerm) just needs to be called later, or stripslashes-ed before being checked? Whatever.

PLEASE don't put off fixing this until v3.0
This problem has kept many of us chasing our tails for months!

bottom line:
The scalability of the vBulletin platform DEPENDS on overcoming this problem!
Reply With Quote
  #2  
Old 09-22-2002, 06:22 PM
Logician's Avatar
Logician Logician is offline
 
Join Date: Nov 2001
Location: inside vb code
Posts: 4,449
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

Quote:
Originally posted by sparky2
NON [A-z][0-9] CHARS ARE CLEARLY NOT BEING STRIPPED
You're missing the point that the script is not coded to work in English language only. If you strip posts with regex's pattern modifiers (like [A-z][0-9] as you mentioned), the searching function might be useless for many non-english boards.

Besides its not always words someone might want to search in the board. Eg. In this board I would like to make a search for keyword "$bbuserinfo[usergroupid]". If you strip $ sign, I wouldnt find any results. Same applies to many other chars if board subject might be "maths" (+=-/.,), "programming" ($()[]{}&#;'!"), "finance" ($?%) etc. So IMO vb's existing structure is logical.

I didnt check it out but if * is not stripped either, you are right about this char though. It can never be matched as it's the wildcard char but I would say this is not an very important issue anyway, dont you think?
Reply With Quote
  #3  
Old 09-23-2002, 01:35 PM
sparky2 sparky2 is offline
 
Join Date: Jul 2002
Posts: 31
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

Logician, your point about other chars / other languages is well-taken.

If you page through the contents of your WORD table though, using queries like this:

SELECT title FROM word where wordid>1000 and wordid<2000;

you would (anyone would!) have to agree that a large portion of the table's records contain "non-searchword" strings. The purpose of the table is to store WORDS, not "junk".
Reply With Quote
  #4  
Old 09-23-2002, 02:29 PM
Logician's Avatar
Logician Logician is offline
 
Join Date: Nov 2001
Location: inside vb code
Posts: 4,449
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

Quote:
Originally posted by sparky2
If you page through the contents of your WORD table though, using queries like this:

SELECT title FROM word where wordid>1000 and wordid<2000;

you would (anyone would!) have to agree that a large portion of the table's records contain "non-searchword" strings.
Well, I have gone through 0 to 10000 and my table is perfectly ok as you can see in the attachment..

Quote:
The purpose of the table is to store WORDS, not "junk".
I would say "words" is not the correct term.. The purpose of the table is to store "anything" in a post which can be used in a search string by a member in the future. And I believe "anything" should include every char, apart from * and <space>. Otherwise some boards specializing in different subjects as I mentioned in my previous message wouldnt be supported.

So I believe this structure of vb is right and if I were the developer of vb, I would code it in the same way. But if you are saying that this structure can be enhanced in a way that Admins would be able to restrict this behaviour according to their board's language, subject and charactaristics and set some custom striping rules, yes I would agree this would be a good enhancement.

And if you are looking for this enhancement (hack), you are in the right site either.
Attached Images
File Type: jpg select.jpg (127.4 KB, 0 views)
Reply With Quote
  #5  
Old 09-23-2002, 06:10 PM
sparky2 sparky2 is offline
 
Join Date: Jul 2002
Posts: 31
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

there is a wordsonly() function in ~admin/functions.php
which looks like it's designed to do EXACTLY what I'm suggesting/describing. However, I can't find any place in any of the vBulletin code that calls the function.

Considering that lines comprising the "wordsonly" function are immediately prior to the lines containing the "indexpost()" code,
I'm wondering if its inclusion was accidently dropped/forgotten.

I diffed our functions.php against a distribution copy to make sure the file has't been edited here. It hasn't.

============
Above, I said "exactly".
That's not quite correct.

In the line that reads:
$text=ereg_replace("[^ 0-9a-z]"," ",$text);

a space character is used for the replacement.
I'm thinking either of these would be more appropriate:

-- replace any non-alphanumeric char with an EMPTY
(strip, instead of replacing with space)

-=or=-

-- use a temp var, and use an "if" statement
{{{ if a non-alphanumeric char found middle-of-word, do NOT perform $text += $tempvar }}}
(perhaps we should allow a hyphen, or minus, char)
to weed out the "junk" words.


=========== Oh! Lookie what I just found! ===========
http://www.vbulletin.com/forum/showt...threadid=11710

So... this function *WAS* dropped...

...and people have been complaining about the slowness,
system load, and the HUGE size of the searchindex ever since, haven't they?
Reply With Quote
  #6  
Old 09-23-2002, 06:36 PM
sparky2 sparky2 is offline
 
Join Date: Jul 2002
Posts: 31
Благодарил(а): 0 раз(а)
Поблагодарили: 0 раз(а) в 0 сообщениях
Default

(sigh)
(*yes, I'm replying to myself.)

Dropping this function, and opting to include a user-defined "badwords" array
probably seemed (to the developers, at the time) as a necessary
step in allowing for multilingual installations.

Wouldn't maintenance of the "badwords" entries be better handled the way IPban BanEmailDomain entries are handled?
(Entries are stored space-delimited, as a looong string, which resides in the settings table) e.g.:
vbb.setting
settingid = 102
title = "Email addresses to Ban"
varname = banemail
optioncode = textarea

and are editable via the AdminCP}}vBulletin Settings interface
Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT. The time now is 04:32 PM.


Powered by vBulletin® Version 3.8.12 by vBS
Copyright ©2000 - 2024, vBulletin Solutions Inc.
X vBulletin 3.8.12 by vBS Debug Information
  • Page Generation 0.03804 seconds
  • Memory Usage 2,243KB
  • Queries Executed 14 (?)
More Information
Template Usage:
  • (1)SHOWTHREAD
  • (1)ad_footer_end
  • (1)ad_footer_start
  • (1)ad_header_end
  • (1)ad_header_logo
  • (1)ad_navbar_below
  • (1)ad_showthread_beforeqr
  • (1)ad_showthread_firstpost
  • (1)ad_showthread_firstpost_sig
  • (1)ad_showthread_firstpost_start
  • (3)bbcode_quote
  • (1)footer
  • (1)forumjump
  • (1)forumrules
  • (1)gobutton
  • (1)header
  • (1)headinclude
  • (1)navbar
  • (3)navbar_link
  • (120)option
  • (6)post_thanks_box
  • (6)post_thanks_button
  • (1)post_thanks_javascript
  • (1)post_thanks_navbar_search
  • (6)post_thanks_postbit_info
  • (6)postbit
  • (1)postbit_attachment
  • (6)postbit_onlinestatus
  • (6)postbit_wrapper
  • (1)spacer_close
  • (1)spacer_open
  • (1)tagbit_wrapper 

Phrase Groups Available:
  • global
  • inlinemod
  • postbit
  • posting
  • reputationlevel
  • showthread
Included Files:
  • ./showthread.php
  • ./global.php
  • ./includes/init.php
  • ./includes/class_core.php
  • ./includes/config.php
  • ./includes/functions.php
  • ./includes/class_hook.php
  • ./includes/modsystem_functions.php
  • ./includes/functions_bigthree.php
  • ./includes/class_postbit.php
  • ./includes/class_bbcode.php
  • ./includes/functions_reputation.php
  • ./includes/functions_post_thanks.php 

Hooks Called:
  • init_startup
  • init_startup_session_setup_start
  • init_startup_session_setup_complete
  • cache_permissions
  • fetch_postinfo_query
  • fetch_postinfo
  • fetch_threadinfo_query
  • fetch_threadinfo
  • fetch_foruminfo
  • style_fetch
  • cache_templates
  • global_start
  • parse_templates
  • global_setup_complete
  • showthread_start
  • showthread_getinfo
  • forumjump
  • showthread_post_start
  • showthread_query_postids
  • showthread_query
  • bbcode_fetch_tags
  • bbcode_create
  • showthread_postbit_create
  • postbit_factory
  • postbit_display_start
  • post_thanks_function_post_thanks_off_start
  • post_thanks_function_post_thanks_off_end
  • post_thanks_function_fetch_thanks_start
  • post_thanks_function_fetch_thanks_end
  • post_thanks_function_thanked_already_start
  • post_thanks_function_thanked_already_end
  • fetch_musername
  • postbit_imicons
  • bbcode_parse_start
  • bbcode_parse_complete_precache
  • bbcode_parse_complete
  • postbit_display_complete
  • post_thanks_function_can_thank_this_post_start
  • postbit_attachment
  • tag_fetchbit_complete
  • forumrules
  • navbits
  • navbits_complete
  • showthread_complete