I've posted this over at the vbulletin.com site,
in a thread titled "search feature: ongoing questions and worries"
but felt compelled to also post it here.
I'm fairly convinced that something is wrong with the vBulletin search algorithm (as of v2.2.7) and/or the underlying db schema, and offer the following as "proof":
During the past 2 months, I've tweaked our MySQL conf, php.ini,
and Apache httpd.conf countless times, trying to eliminate
anything else which might be a/the limiting factor... and it all
comes back to (down to) continual I/O bottlenecks from MySQL
threads reading vBulletin's halfGig+ searchindex.MYD and related
tables, with everything queueing up behind them.
THE SEARCHINDEX SHOULD NOT BE THIS LARGE.
-=-
SOMETHING _MUST_ BE WRONG WITH THE REGEX // PREG LOGIC USED IN BUILDING THE SEARCH INDEX!
I'm currently rebuilding the search index (for the nth time).
After emptying the prior (via the vbb adminCP interface)
and before rebuilding, I watched the WORD table begin
filling realtime ~~ looking for words I might want to define
as badwords array elements.
Hooboy! I went away for 12hrs and came back to find 560,000 records in the WORD table (and that about a third of our 2million posts had been indexed at that point). "There aren't that many words in the English language, dammit!" (I muttered) and started poking for answers.
I paged through the first 10,000 (or so) entries in the WORD table
e.g.
SELECT title FROM word where wordid>1000 and wordid<2000;
and they were all CRAP ~~ non-words, with leading "punctuation" chars
(
$something, &something, %something, ...something, *something, 45bucks,and even
a.m )
!!!
"Oh, but these won't ALL actually be referenced in the searchindex table, right?!?" I wondered.
WRONG!
mysql>
select distinct wordid from searchindex order by wordid limit 500;
I picked several to check, f'rinstance
searchindex.wordid = 2737
and then double-checked (xrefed) them in the WORD table:
mysql>
select wordid,title from word where title like '$%' limit 500;
+--------+-----------------+
| wordid | title |
+--------+-----------------+
| 2737 | $$ |
| 5639 | $$$ |
| 14701 | $$$$ |
| 14613 | $$$$$ |
{ snip }
| 215960 | $$$$$$$4 |
| 568393 | $$$$$$.tia |
| 219585 | $$$$$$s |
| 571997 | $$$$$...help |
| 194704 | $$$$$110.00 |
{ snip }
Not only is this "non-word" IN THE SEARCHINDEX, it's referenced THOUSANDS of times!, vis:
select count(wordid) from searchindex where wordid=2737;
+---------------+
| count(wordid) |
+---------------+
| 2477 |
+---------------+
What I'm saying (accusing?) is that although search phrases are being explode(ed) into words ~~ and punctuation stripped ~~ at runtime, NON [A-z][0-9] CHARS ARE CLEARLY NOT BEING STRIPPED (ref: vbb.word.title) WHEN THE SEARCH INDEX IS BUILT!
The 900 (or so) "badwords" in our list ARE being excluded, but these exclusions are tip-of-the-iceberg-insignificant in the scheme of things here.
Maybe the REGEX is okay. Maybe addslashes(htmlspecialcharacters(searchTerm) just needs to be called later, or stripslashes-ed before being checked? Whatever.
PLEASE don't put off fixing this until v3.0
This problem has kept many of us chasing our tails for months!
bottom line:
The scalability of the vBulletin platform DEPENDS on overcoming this problem!