Quote:
Originally Posted by AndrewD
Thanks, Felix. Indeed the problem is/was the word boundary. The drawback with removing the word boundary markers is that you end up highlighting substrings in the results which the search itself did not match.
For example, suppose you have a string "happily merrily sadly happilymerrilysadly" and you do a search for merrily
This should highlight as "happily merrily sadly happilymerrilysadly"
and it does with the word boundary flags in the regex.
But without them, it highlights as "happily merrily sadly happilymerrilysadly"
So we need to solve the word boundary problem in utf8.
|
Well then.. I am happy..

then it is actually a feature..
if you search for "intern" in google.. in the description and the title, words like
international or
internal or
internship are highlighted!!!!
i was going to anyway modify the search from "word" to "*word*" because if i do a search for "luxury" and only have one entry with the word "luxuryhotels" in description.. i would get no results..it would not show up.. in that case at least the highlighting would allready be done..
---------------------
on the otherhand.. using ldm as is.. it is also not a major drawback:
if you are looking for merrily ...
it will only show you results where the word "merrily" is standalone... so you
do have the correct results.. and if you have
an extra sadlymerrilysadly
then only it will be highlighted.. wich i think is a feature!!!
---------------------
so if it is the only drawback.. i'm sticking to that solution, especially as php6 is going to have full unicode support.. and I am ready to bet that in php6 this problem will be solved!!
But at least for the moment adding the /u modifier (making it /iu) to the regex will help for languages like german, french or spanish as the highlighting will work as you expect it..
Felix
PS: just seen your edit.. doing testing now!
[EDIT]
just tested your routine... works fine with description....(not working with keywords) hmmm
BUT with chinese there is another problem... did some reading (i do not understand chinese)
i was trying to extract content to use as description.. thats how i stumbled into this article:
it says
Quote:
Chinese sentences are written with no special delimiters such as space to indicate word boundaries. Existing Chinese NLP systems therefore employ preprocessors to segment sentences into words.
|
source:
http://portal.acm.org/citation.cfm?id=981621
if this is true i think that the "no boundary" version will for the moment be the easiest solution...for chinese