The Arcive of Official vBulletin Modifications Site.It is not a VB3 engine, just a parsed copy! |
|
#1
|
|||
|
|||
Understanding post content unicode
Hi-
I am trying to include foreign language snippets into post content during post creation via script execution. I have read the information about the foreign language downloads that you can install in your forum and have read a few posts regarding unicode processing in vB, but I haven't found a way of doing this yet. Thus this post requesting guidance. I haven't altered the unicode settings for my forum, mostly because my testing has shown that this didn't have an effect on the post content, though I could have missed something along the way. This is a summary of what I tested: To begin I simply added a new post interactively via the standard editor on my test forum. I cut and pasted the text, which is cyrillic BTW, directly into the message and pressed the submit button. The text rendered correctly: катушки and upon viewing the source I saw that the standard HTML entities for the unicode characters that correspond to the entered text were displayed: катушки So the question I have is how to replicate this via a script? The data that my script receives has these characters represented in their unicode \u0XXX form (at least I think that's the standard form, though I seem to recall the %uXXX form too) which is easily convertable to the HTML equivalent. However when my script does that and then submits the post (via a post DM object) all I see are the above HTML entities in the textual content of that post. And of course the same result occurred when leaving them as \u0XXX. So I dug around in the code. I tried applying html_entity_decode to the body of the post prior to submitting it, but that didn't have any effect. I dug further. I found a couple of interesting items that I was going to attempt next, specifically: unhtmlspecialchars (vB function) htmlspecialchars_decode (php function) The unthmlspecialchars would need the second parm set to true or else it won't decode unicode entities and I only saw this done in a couple of places within all of the vB code. Anyway it was about midnight when I found these and so I haven't tried them yet, mostly because I'm not sure they will work or if it is even the correct approach to the problem. So in the end I'd just like to ask: has anyone else dealt with this issue already and if so, can you describe how you solved it? I conducted a few searches on vb.org/forum but didn't find anything. Thanks! Jerry |
#2
|
|||
|
|||
I haven't dealt with this before, but doing a little searching, I found this page: http://stackoverflow.com/questions/2...8-encoded-char
and adapting it a little, I found that if for example your message is saved in $message and looks like this: "\u043a\u0430\u0442\u0443\u0448\u043a\u0438", and you do something like this: Code:
function replace_unicode_escape_sequence($match) { return mb_convert_encoding(pack('H*', $match[1]), 'HTML-ENTITIES', 'UCS-2BE'); } $message = preg_replace_callback('/\\\\u([0-9a-f]{4})/i', 'replace_unicode_escape_sequence', $message); before sending $message to the dm object, then you will get a post with cyrillic chars. |
#3
|
|||
|
|||
Hi again kh99-
Thank you for the input. Websites like stackoverflow and devshed can be helpful sometimes yes? I found this late last night too, but unfortunately it doesn't work. With this the output (page rendering) is then: u043Au0430u0442u0443u0448u043Au0438 So I'm back at the drawing board. I'll post if I find a solution. If you have any additional ideas, feel free to share, I'd appreciate it. Thanks! Jerry ps-I could have taken a left where I should have taken a right, so I am verifying at the moment... pps-Back again: I found the place where I zigged where I should have zagged. [scratching the top of my head] Still working... |
#4
|
|||
|
|||
I tested it by making a plugin using newpost_process and this code:
Code:
function replace_unicode_escape_sequence($match) { return mb_convert_encoding(pack('H*', $match[1]), 'HTML-ENTITIES', 'UCS-2BE'); } $post['message'] = preg_replace_callback('/\\\\u([0-9a-f]{4})/i', 'replace_unicode_escape_sequence', "\u043a\u0430\u0442\u0443\u0448\u043a\u0438"); BTW, it's a test system. Don't put it on a live forum or every post will be катушки. |
#5
|
|||
|
|||
Hmmmm...ok. Well I only have this code in my script that is a module in the overall migration utility from Lefora to vB, so it won't be accessible in the forum, no.
I'm still puzzled because I have been seeing this error: HTML Code:
<b>Fatal error</b>: Cannot redeclare replace_unicode_escape_sequence() (previously declared in /home/russia/public_html/testvb/AddNewThreadPost.php:250) in <b>/home/russia/public_html/testvb/AddNewThreadPost.php</b> on line <b>250</b><br /> PHP Code:
Ack. Back at it today. Thanks |
#6
|
|||
|
|||
Hmm...yeah, you can't redefine the function over and over, but defining it once at the beginning (outside any loop) should work, or doing something like:
Code:
if (!function_exists("replace_unicode_escape_sequence")) function replace_unicode_escape_sequence($match){ return mb_convert_encoding(pack('H*', $match[1]), 'HTML-ENTITIES', 'UCS-2BE'); } Maybe HTML-ENTITES is new to PHP5? Doesn't look like it - or at least it's not listed in the changes. |
#7
|
|||
|
|||
kh99-
Much thanks again. I'm not sure if HTML-ENTITIES is new to PHP5 or not, but I saw it in the list of supported encodings of the PHP version (5-something) that my webhosting service provides, so I knew I was good-to-go in that regard. Anyway. I found the error of my ways and now this little trick worked. Hopefully this can help someone in the future. Jerry |
|
|
X vBulletin 3.8.12 by vBS Debug Information | |
---|---|
|
|
More Information | |
Template Usage:
Phrase Groups Available:
|
Included Files:
Hooks Called:
|