The Arcive of Official vBulletin Modifications Site.It is not a VB3 engine, just a parsed copy! |
|
#1
|
|||
|
|||
something "ain't right" with vBulletin's searchindex code
I've posted this over at the vbulletin.com site,
in a thread titled "search feature: ongoing questions and worries" but felt compelled to also post it here. I'm fairly convinced that something is wrong with the vBulletin search algorithm (as of v2.2.7) and/or the underlying db schema, and offer the following as "proof": During the past 2 months, I've tweaked our MySQL conf, php.ini, and Apache httpd.conf countless times, trying to eliminate anything else which might be a/the limiting factor... and it all comes back to (down to) continual I/O bottlenecks from MySQL threads reading vBulletin's halfGig+ searchindex.MYD and related tables, with everything queueing up behind them. THE SEARCHINDEX SHOULD NOT BE THIS LARGE. -=- SOMETHING _MUST_ BE WRONG WITH THE REGEX // PREG LOGIC USED IN BUILDING THE SEARCH INDEX! I'm currently rebuilding the search index (for the nth time). After emptying the prior (via the vbb adminCP interface) and before rebuilding, I watched the WORD table begin filling realtime ~~ looking for words I might want to define as badwords array elements. Hooboy! I went away for 12hrs and came back to find 560,000 records in the WORD table (and that about a third of our 2million posts had been indexed at that point). "There aren't that many words in the English language, dammit!" (I muttered) and started poking for answers. I paged through the first 10,000 (or so) entries in the WORD table e.g. SELECT title FROM word where wordid>1000 and wordid<2000; and they were all CRAP ~~ non-words, with leading "punctuation" chars ( $something, &something, %something, ...something, *something, 45bucks,and even a.m ) !!! "Oh, but these won't ALL actually be referenced in the searchindex table, right?!?" I wondered. WRONG! mysql> select distinct wordid from searchindex order by wordid limit 500; I picked several to check, f'rinstance searchindex.wordid = 2737 and then double-checked (xrefed) them in the WORD table: mysql> select wordid,title from word where title like '$%' limit 500; +--------+-----------------+ | wordid | title | +--------+-----------------+ | 2737 | $$ | | 5639 | $$$ | | 14701 | $$$$ | | 14613 | $$$$$ | { snip } | 215960 | $$$$$$$4 | | 568393 | $$$$$$.tia | | 219585 | $$$$$$s | | 571997 | $$$$$...help | | 194704 | $$$$$110.00 | { snip } Not only is this "non-word" IN THE SEARCHINDEX, it's referenced THOUSANDS of times!, vis: select count(wordid) from searchindex where wordid=2737; +---------------+ | count(wordid) | +---------------+ | 2477 | +---------------+ What I'm saying (accusing?) is that although search phrases are being explode(ed) into words ~~ and punctuation stripped ~~ at runtime, NON [A-z][0-9] CHARS ARE CLEARLY NOT BEING STRIPPED (ref: vbb.word.title) WHEN THE SEARCH INDEX IS BUILT! The 900 (or so) "badwords" in our list ARE being excluded, but these exclusions are tip-of-the-iceberg-insignificant in the scheme of things here. Maybe the REGEX is okay. Maybe addslashes(htmlspecialcharacters(searchTerm) just needs to be called later, or stripslashes-ed before being checked? Whatever. PLEASE don't put off fixing this until v3.0 This problem has kept many of us chasing our tails for months! bottom line: The scalability of the vBulletin platform DEPENDS on overcoming this problem! |
#2
|
||||
|
||||
Quote:
Besides its not always words someone might want to search in the board. Eg. In this board I would like to make a search for keyword "$bbuserinfo[usergroupid]". If you strip $ sign, I wouldnt find any results. Same applies to many other chars if board subject might be "maths" (+=-/.,), "programming" ($()[]{}&#;'!"), "finance" ($?%) etc. So IMO vb's existing structure is logical. I didnt check it out but if * is not stripped either, you are right about this char though. It can never be matched as it's the wildcard char but I would say this is not an very important issue anyway, dont you think? |
#3
|
|||
|
|||
Logician, your point about other chars / other languages is well-taken.
If you page through the contents of your WORD table though, using queries like this: SELECT title FROM word where wordid>1000 and wordid<2000; you would (anyone would!) have to agree that a large portion of the table's records contain "non-searchword" strings. The purpose of the table is to store WORDS, not "junk". |
#4
|
||||
|
||||
Quote:
Quote:
So I believe this structure of vb is right and if I were the developer of vb, I would code it in the same way. But if you are saying that this structure can be enhanced in a way that Admins would be able to restrict this behaviour according to their board's language, subject and charactaristics and set some custom striping rules, yes I would agree this would be a good enhancement. And if you are looking for this enhancement (hack), you are in the right site either. |
#5
|
|||
|
|||
there is a wordsonly() function in ~admin/functions.php
which looks like it's designed to do EXACTLY what I'm suggesting/describing. However, I can't find any place in any of the vBulletin code that calls the function. Considering that lines comprising the "wordsonly" function are immediately prior to the lines containing the "indexpost()" code, I'm wondering if its inclusion was accidently dropped/forgotten. I diffed our functions.php against a distribution copy to make sure the file has't been edited here. It hasn't. ============ Above, I said "exactly". That's not quite correct. In the line that reads: $text=ereg_replace("[^ 0-9a-z]"," ",$text); a space character is used for the replacement. I'm thinking either of these would be more appropriate: -- replace any non-alphanumeric char with an EMPTY (strip, instead of replacing with space) -=or=- -- use a temp var, and use an "if" statement {{{ if a non-alphanumeric char found middle-of-word, do NOT perform $text += $tempvar }}} (perhaps we should allow a hyphen, or minus, char) to weed out the "junk" words. =========== Oh! Lookie what I just found! =========== http://www.vbulletin.com/forum/showt...threadid=11710 So... this function *WAS* dropped... ...and people have been complaining about the slowness, system load, and the HUGE size of the searchindex ever since, haven't they? |
#6
|
|||
|
|||
(sigh)
(*yes, I'm replying to myself.) Dropping this function, and opting to include a user-defined "badwords" array probably seemed (to the developers, at the time) as a necessary step in allowing for multilingual installations. Wouldn't maintenance of the "badwords" entries be better handled the way IPban BanEmailDomain entries are handled? (Entries are stored space-delimited, as a looong string, which resides in the settings table) e.g.: vbb.setting settingid = 102 title = "Email addresses to Ban" varname = banemail optioncode = textarea and are editable via the AdminCP}}vBulletin Settings interface |
Thread Tools | |
Display Modes | |
|
|
X vBulletin 3.8.12 by vBS Debug Information | |
---|---|
|
|
More Information | |
Template Usage:
Phrase Groups Available:
|
Included Files:
Hooks Called:
|