The Arcive of Official vBulletin Modifications Site.It is not a VB3 engine, just a parsed copy! |
|
|
#1
|
|||
|
|||
something "ain't right" with vBulletin's searchindex code
I've posted this over at the vbulletin.com site,
in a thread titled "search feature: ongoing questions and worries" but felt compelled to also post it here. I'm fairly convinced that something is wrong with the vBulletin search algorithm (as of v2.2.7) and/or the underlying db schema, and offer the following as "proof": During the past 2 months, I've tweaked our MySQL conf, php.ini, and Apache httpd.conf countless times, trying to eliminate anything else which might be a/the limiting factor... and it all comes back to (down to) continual I/O bottlenecks from MySQL threads reading vBulletin's halfGig+ searchindex.MYD and related tables, with everything queueing up behind them. THE SEARCHINDEX SHOULD NOT BE THIS LARGE. -=- SOMETHING _MUST_ BE WRONG WITH THE REGEX // PREG LOGIC USED IN BUILDING THE SEARCH INDEX! I'm currently rebuilding the search index (for the nth time). After emptying the prior (via the vbb adminCP interface) and before rebuilding, I watched the WORD table begin filling realtime ~~ looking for words I might want to define as badwords array elements. Hooboy! I went away for 12hrs and came back to find 560,000 records in the WORD table (and that about a third of our 2million posts had been indexed at that point). "There aren't that many words in the English language, dammit!" (I muttered) and started poking for answers. I paged through the first 10,000 (or so) entries in the WORD table e.g. SELECT title FROM word where wordid>1000 and wordid<2000; and they were all CRAP ~~ non-words, with leading "punctuation" chars ( $something, &something, %something, ...something, *something, 45bucks,and even a.m ) !!! "Oh, but these won't ALL actually be referenced in the searchindex table, right?!?" I wondered. WRONG! mysql> select distinct wordid from searchindex order by wordid limit 500; I picked several to check, f'rinstance searchindex.wordid = 2737 and then double-checked (xrefed) them in the WORD table: mysql> select wordid,title from word where title like '$%' limit 500; +--------+-----------------+ | wordid | title | +--------+-----------------+ | 2737 | $$ | | 5639 | $$$ | | 14701 | $$$$ | | 14613 | $$$$$ | { snip } | 215960 | $$$$$$$4 | | 568393 | $$$$$$.tia | | 219585 | $$$$$$s | | 571997 | $$$$$...help | | 194704 | $$$$$110.00 | { snip } Not only is this "non-word" IN THE SEARCHINDEX, it's referenced THOUSANDS of times!, vis: select count(wordid) from searchindex where wordid=2737; +---------------+ | count(wordid) | +---------------+ | 2477 | +---------------+ What I'm saying (accusing?) is that although search phrases are being explode(ed) into words ~~ and punctuation stripped ~~ at runtime, NON [A-z][0-9] CHARS ARE CLEARLY NOT BEING STRIPPED (ref: vbb.word.title) WHEN THE SEARCH INDEX IS BUILT! The 900 (or so) "badwords" in our list ARE being excluded, but these exclusions are tip-of-the-iceberg-insignificant in the scheme of things here. Maybe the REGEX is okay. Maybe addslashes(htmlspecialcharacters(searchTerm) just needs to be called later, or stripslashes-ed before being checked? Whatever. PLEASE don't put off fixing this until v3.0 This problem has kept many of us chasing our tails for months! bottom line: The scalability of the vBulletin platform DEPENDS on overcoming this problem! |
|
|
X vBulletin 3.8.12 by vBS Debug Information | |
---|---|
|
|
More Information | |
Template Usage:
Phrase Groups Available:
|
Included Files:
Hooks Called:
|