PDA

View Full Version : Censorship case-sensitive


Wlad
03-01-2010, 01:18 PM
Hello!

Censorship case-sensitive, ie word "привет" and "Привет" will recognize as different. When the words I write in the Latin alphabet, ie "hello" and "Hello" everything is normal, replace "******" two words. How can I make censorship is not case-sensitive letters to Cyrillic?

PS: sorry for my English, I use a translator

Marco van Herwaarden
03-01-2010, 02:16 PM
I am not 100% sure what you are asking. Do you want both those russian words to be treated the same, regardless of the case?

Wlad
03-01-2010, 04:50 PM
Marco van Herwaarden, yes. If you write "Привет привет приВет ПРивет", and censor word = "привет", we get "Привет ****** приВет ПРивет". Must "****** ****** ****** ******"

Marco van Herwaarden
03-02-2010, 08:19 AM
I have not encountered this problem before, but my guess is that is caused by the characterset used for MySQL. If that doesn't know both characters are the same but only in ifferent case, then it won't work.

kh99
03-02-2010, 12:36 PM
I'm basing this answer on only about 10 minutes of research, but: it looks like the censored words are detected using the PHP "preg" functions http://us2.php.net/manual/en/ref.pcre.php which are based on something called the PCRE library of functions: http://www.pcre.org/ But beyond that I don't know how to tell you to fix it. It could have to do with how the server locale is set or how those libraries were built. (Some of the comments on this page might be helpful: http://us2.php.net/manual/en/function.setlocale.php)

One comment from the php manual site http://us2.php.net/manual/en/function.preg-match.php mentions changing the pattern string to force use of UTF-8. I have no idea if this would fix your problem but it's something you could probably try easily:

I noticed that in order to deal with UTF-8 texts, without having to recompile php with the PCRE UTF-8 flag enabled, you can just add the following sequence at the start of your pattern: (*UTF8)

for instance : '#(*UTF8)[[:alnum:]]#' will return TRUE for '?' where '#[[:alnum:]]#' will return FALSE

found this very very useful tip after hours of research over the web directly in pcre website right here : http://www.pcre.org/pcre.txt
there are many further informations about UTF-8 support in the lib

hop that will help!

--
cedric


doing a grep for $vbulletin->options['censorwords'] finds two files in the includes directory, functions.php and class_dm_user.php, so that's at least a place to start.

Maybe someone else out there knows more about locales?