vb.org Archive

vb.org Archive (https://vborg.vbsupport.ru/index.php)
-   vBulletin.org Site Feedback (https://vborg.vbsupport.ru/forumdisplay.php?f=7)
-   -   something "ain't right" with vBulletin's searchindex code (https://vborg.vbsupport.ru/showthread.php?t=43800)

sparky2 09-22-2002 04:00 AM

something "ain't right" with vBulletin's searchindex code
 
I've posted this over at the vbulletin.com site,
in a thread titled "search feature: ongoing questions and worries"
but felt compelled to also post it here.



I'm fairly convinced that something is wrong with the vBulletin search algorithm (as of v2.2.7) and/or the underlying db schema, and offer the following as "proof":

During the past 2 months, I've tweaked our MySQL conf, php.ini,
and Apache httpd.conf countless times, trying to eliminate
anything else which might be a/the limiting factor... and it all
comes back to (down to) continual I/O bottlenecks from MySQL
threads reading vBulletin's halfGig+ searchindex.MYD and related
tables, with everything queueing up behind them.

THE SEARCHINDEX SHOULD NOT BE THIS LARGE.
-=-
SOMETHING _MUST_ BE WRONG WITH THE REGEX // PREG LOGIC USED IN BUILDING THE SEARCH INDEX!


I'm currently rebuilding the search index (for the nth time).
After emptying the prior (via the vbb adminCP interface)
and before rebuilding, I watched the WORD table begin
filling realtime ~~ looking for words I might want to define
as badwords array elements.

Hooboy! I went away for 12hrs and came back to find 560,000 records in the WORD table (and that about a third of our 2million posts had been indexed at that point). "There aren't that many words in the English language, dammit!" (I muttered) and started poking for answers.

I paged through the first 10,000 (or so) entries in the WORD table
e.g. SELECT title FROM word where wordid>1000 and wordid<2000;
and they were all CRAP ~~ non-words, with leading "punctuation" chars
( $something, &something, %something, ...something, *something, 45bucks,and even a.m )
!!!

"Oh, but these won't ALL actually be referenced in the searchindex table, right?!?" I wondered.
WRONG!

mysql> select distinct wordid from searchindex order by wordid limit 500;

I picked several to check, f'rinstance searchindex.wordid = 2737
and then double-checked (xrefed) them in the WORD table:

mysql> select wordid,title from word where title like '$%' limit 500;
+--------+-----------------+
| wordid | title |
+--------+-----------------+
| 2737 | $$ |
| 5639 | $$$ |
| 14701 | $$$$ |
| 14613 | $$$$$ |
{ snip }
| 215960 | $$$$$$$4 |
| 568393 | $$$$$$.tia |
| 219585 | $$$$$$s |
| 571997 | $$$$$...help |
| 194704 | $$$$$110.00 |
{ snip }

Not only is this "non-word" IN THE SEARCHINDEX, it's referenced THOUSANDS of times!, vis:
select count(wordid) from searchindex where wordid=2737;
+---------------+
| count(wordid) |
+---------------+
| 2477 |
+---------------+

What I'm saying (accusing?) is that although search phrases are being explode(ed) into words ~~ and punctuation stripped ~~ at runtime, NON [A-z][0-9] CHARS ARE CLEARLY NOT BEING STRIPPED (ref: vbb.word.title) WHEN THE SEARCH INDEX IS BUILT!

The 900 (or so) "badwords" in our list ARE being excluded, but these exclusions are tip-of-the-iceberg-insignificant in the scheme of things here.

Maybe the REGEX is okay. Maybe addslashes(htmlspecialcharacters(searchTerm) just needs to be called later, or stripslashes-ed before being checked? Whatever.

PLEASE don't put off fixing this until v3.0
This problem has kept many of us chasing our tails for months!

bottom line:
The scalability of the vBulletin platform DEPENDS on overcoming this problem!

Logician 09-22-2002 06:22 PM

Quote:

Originally posted by sparky2
NON [A-z][0-9] CHARS ARE CLEARLY NOT BEING STRIPPED
You're missing the point that the script is not coded to work in English language only. If you strip posts with regex's pattern modifiers (like [A-z][0-9] as you mentioned), the searching function might be useless for many non-english boards.

Besides its not always words someone might want to search in the board. Eg. In this board I would like to make a search for keyword "$bbuserinfo[usergroupid]". If you strip $ sign, I wouldnt find any results. Same applies to many other chars if board subject might be "maths" (+=-/.,), "programming" ($()[]{}&#;'!"), "finance" ($?%) etc. So IMO vb's existing structure is logical.

I didnt check it out but if * is not stripped either, you are right about this char though. It can never be matched as it's the wildcard char but I would say this is not an very important issue anyway, dont you think? :)

sparky2 09-23-2002 01:35 PM

Logician, your point about other chars / other languages is well-taken.

If you page through the contents of your WORD table though, using queries like this:

SELECT title FROM word where wordid>1000 and wordid<2000;

you would (anyone would!) have to agree that a large portion of the table's records contain "non-searchword" strings. The purpose of the table is to store WORDS, not "junk".

Logician 09-23-2002 02:29 PM

1 Attachment(s)
Quote:

Originally posted by sparky2
If you page through the contents of your WORD table though, using queries like this:

SELECT title FROM word where wordid>1000 and wordid<2000;

you would (anyone would!) have to agree that a large portion of the table's records contain "non-searchword" strings.

Well, I have gone through 0 to 10000 and my table is perfectly ok as you can see in the attachment..

Quote:

The purpose of the table is to store WORDS, not "junk".
I would say "words" is not the correct term.. The purpose of the table is to store "anything" in a post which can be used in a search string by a member in the future. And I believe "anything" should include every char, apart from * and <space>. Otherwise some boards specializing in different subjects as I mentioned in my previous message wouldnt be supported.

So I believe this structure of vb is right and if I were the developer of vb, I would code it in the same way. But if you are saying that this structure can be enhanced in a way that Admins would be able to restrict this behaviour according to their board's language, subject and charactaristics and set some custom striping rules, yes I would agree this would be a good enhancement.

And if you are looking for this enhancement (hack), you are in the right site either. :)

sparky2 09-23-2002 06:10 PM

there is a wordsonly() function in ~admin/functions.php
which looks like it's designed to do EXACTLY what I'm suggesting/describing. However, I can't find any place in any of the vBulletin code that calls the function.

Considering that lines comprising the "wordsonly" function are immediately prior to the lines containing the "indexpost()" code,
I'm wondering if its inclusion was accidently dropped/forgotten.

I diffed our functions.php against a distribution copy to make sure the file has't been edited here. It hasn't.

============
Above, I said "exactly".
That's not quite correct.

In the line that reads:
$text=ereg_replace("[^ 0-9a-z]"," ",$text);

a space character is used for the replacement.
I'm thinking either of these would be more appropriate:

-- replace any non-alphanumeric char with an EMPTY
(strip, instead of replacing with space)

-=or=-

-- use a temp var, and use an "if" statement
{{{ if a non-alphanumeric char found middle-of-word, do NOT perform $text += $tempvar }}}
(perhaps we should allow a hyphen, or minus, char)
to weed out the "junk" words.


=========== Oh! Lookie what I just found! ===========
http://www.vbulletin.com/forum/showt...threadid=11710

So... this function *WAS* dropped...

...and people have been complaining about the slowness,
system load, and the HUGE size of the searchindex ever since, haven't they?

sparky2 09-23-2002 06:36 PM

(sigh)
(*yes, I'm replying to myself.)

Dropping this function, and opting to include a user-defined "badwords" array
probably seemed (to the developers, at the time) as a necessary
step in allowing for multilingual installations.

Wouldn't maintenance of the "badwords" entries be better handled the way IPban BanEmailDomain entries are handled?
(Entries are stored space-delimited, as a looong string, which resides in the settings table) e.g.:
vbb.setting
settingid = 102
title = "Email addresses to Ban"
varname = banemail
optioncode = textarea

and are editable via the AdminCP}}vBulletin Settings interface


All times are GMT. The time now is 10:09 AM.

Powered by vBulletin® Version 3.8.12 by vBS
Copyright ©2000 - 2025, vBulletin Solutions Inc.

X vBulletin 3.8.12 by vBS Debug Information
  • Page Generation 0.01148 seconds
  • Memory Usage 1,743KB
  • Queries Executed 10 (?)
More Information
Template Usage:
  • (1)ad_footer_end
  • (1)ad_footer_start
  • (1)ad_header_end
  • (1)ad_header_logo
  • (1)ad_navbar_below
  • (3)bbcode_quote_printable
  • (1)footer
  • (1)gobutton
  • (1)header
  • (1)headinclude
  • (6)option
  • (1)post_thanks_navbar_search
  • (1)printthread
  • (6)printthreadbit
  • (1)spacer_close
  • (1)spacer_open 

Phrase Groups Available:
  • global
  • postbit
  • showthread
Included Files:
  • ./printthread.php
  • ./global.php
  • ./includes/init.php
  • ./includes/class_core.php
  • ./includes/config.php
  • ./includes/functions.php
  • ./includes/class_hook.php
  • ./includes/modsystem_functions.php
  • ./includes/class_bbcode_alt.php
  • ./includes/class_bbcode.php
  • ./includes/functions_bigthree.php 

Hooks Called:
  • init_startup
  • init_startup_session_setup_start
  • init_startup_session_setup_complete
  • cache_permissions
  • fetch_threadinfo_query
  • fetch_threadinfo
  • fetch_foruminfo
  • style_fetch
  • cache_templates
  • global_start
  • parse_templates
  • global_setup_complete
  • printthread_start
  • bbcode_fetch_tags
  • bbcode_create
  • bbcode_parse_start
  • bbcode_parse_complete_precache
  • bbcode_parse_complete
  • printthread_post
  • printthread_complete