Log in

View Full Version : Converting Special Chars from HTML to UTF-8 ascii standard?


Kaelon
09-06-2007, 07:52 PM
Hey there,

I'm using the AddonChat Integration Script (https://vborg.vbsupport.ru/showthread.php?t=131683) and have been working with Chris Duerr, the author, to try and solve this problem: users that have special characters (such as accents, as in ? ? ? ? ?) are getting an invalid username/password notice. This is because vBulletin stores these special characters as HTML escape equivalents.

How can we convert the HTML escape characters to UTF-8 standard ascii characters?

Here is the code cited from the integration script:

<?php
header("Content-type: text/plain; charset=iso-8859-1");
error_reporting(E_ALL & ~E_NOTICE);
define('NO_REGISTER_GLOBALS', 1);
define('SESSION_BYPASS', 1);
define('LOCATION_BYPASS', 1);
//define('DIE_QUIETLY', 1);

/*
We lie a little here to let us get through when
forum read privileges are disabled for non-registered
users.
*/
define('THIS_SCRIPT', 'login');
$_REQUEST['do'] = 'register';
require_once('./global.php');
require_once('./chat_global.php');

$username = $_REQUEST['username'];
$password = $_REQUEST['password'];

/*
Uncomment the following to support non-ASCII UTF-8 characters
Requires PHP Multibyte String (mbstring) Extension
*/
$username = mb_convert_encoding($username, "HTML-ENTITIES", "UTF-8");
$password = mb_convert_encoding($password, "HTML-ENTITIES", "UTF-8");


if(!$SIGMACHAT_VB_AUTHENTICATE) die("DISABLED");

# Fetch User Info from Database..
$uid = 0;
if ($userinfo = $db->query_first('SELECT userid, usergroupid, membergroupids, password, salt FROM ' . TABLE_PREFIX . 'user WHERE username = "' . addslashes(htmlspecialchars_uni($username)) . '"'))
{
# Invalid Password
if (($userinfo['password'] != $password) && ($userinfo['password'] != md5(md5($password) . $userinfo['salt'])))
$auth = 0;
else
{
$usergroups = explode(',', $userinfo[membergroupids]);
$usergroups[] = $userinfo[usergroupid];

$auth = 0;
foreach($usergroups as $ug)
{
if( ($auth < 3) && (in_array($ug, $SIGMACHAT_AUTH_GRANTACCESS)) ) $auth = 3;
if( ($auth < 2) && (in_array($ug, $SIGMACHAT_AUTH_ADMINACCESS)) ) $auth = 2;
if( ($auth < 1) && (in_array($ug, $SIGMACHAT_AUTH_ACCESS)) ) $auth = 1;
if(in_array($ug, $SIGMACHAT_AUTH_NOACCESS)) { $auth = 0; break; }
}
$uid = $userinfo['userid'];
}
}
else
$auth = $SIGMACHAT_AUTH_GUEST;


$result_string = "SCRAS^1.1\nAUTH^$auth\nUID^$uid\n";

if($SIGMACHAT_ENABLE_LINK_PROFILE) $result_string .= "SITE_LINK^Profile^$SIGMACHAT_FORUM_URL/chat_func_profile.php\n";
if($SIGMACHAT_ENABLE_LINK_ADDBUDDY) $result_string .= "SITE_LINK^Add Buddy^$SIGMACHAT_FORUM_URL/chat_func_addbuddy.php\n";
if($SIGMACHAT_ENABLE_LINK_PM) $result_string .= "SITE_LINK^Prv. Message^$SIGMACHAT_FORUM_URL/chat_func_pm.php\n";
if($SIGMACHAT_ENABLE_LINK_EMAIL) $result_string .= "SITE_LINK^eMail^$SIGMACHAT_FORUM_URL/chat_func_email.php\n";
if($SIGMACHAT_ENABLE_LINK_FINDPOSTS) $result_string .= "SITE_LINK^Find Posts^$SIGMACHAT_FORUM_URL/chat_func_findposts.php\n";
if($SIGMACHAT_ENABLE_LINK_FORUM_IGNORE) $result_string .= "SITE_LINK^Forum Ignore^$SIGMACHAT_FORUM_URL/chat_func_ignore.php\n";

print($result_string);

?>

Update -- I've tried using html_entity_decode by calling as follows:

$username = html_entity_decode($username);
$password = html_entity_decode($password);

... where the "uncomment the following" comment is indicated in the above code. That didn't work, tragically.

Paul M
09-06-2007, 08:22 PM
There is a function in vb called unhtmlspecialchars()

From the documentation ;


Returns a string where HTML entities have been converted back to their original characters

string unhtmlspecialchars (string $text, [boolean $doUniCode = false])

string $text: String to be parsed

boolean $doUniCode: Convert unicode characters back from HTML entities?

Kaelon
09-06-2007, 08:50 PM
Thanks, Paul! However, that didn't seem to work. I added:

$username = unhtmlspecialchars($username);
$password = unhtmlspecialchars($password);

... to the previous mb_convert_encoding command-lines, and I was still getting invalid returns from the system. Judging by the code above, is there a more sensible place to convert the unhtmlspecialchars to validate this? Thanks!

Kaelon
09-09-2007, 04:21 PM
Latest information from Chris Duerr, the original hack author:

I'm not familiar with that command -- but it almost seems like you'd want to do the reverse; that is convert the special chars to their HTML representation. Sometimes function names can be confusing though, so you may have the right function.

Do you know the usage of the command, ideally it would be a drop-in replacement for the mb_convert_encoding commands -- it'll be one of the first commands you run in the script.

What we typically do when debugging this sort of thing is to write the output data to a text file (using php file commands within the authentication script) as there is no easy way to simply echo the information to the console when using special characters. This may help by first printing the raw data we send, then print the data as you've converted it, and finally print the raw data stored in the database for comparison to gauge your progress.

Accordingly, is the opposite of unhtmlspecialchars() just htmlspecialchars()?

Paul M
09-09-2007, 06:03 PM
I didn't really read your code, you asked about decoding, which was what I answered.

Looking at your code then yes, you need to do the opposite, you want to code your username to match vb. The vb function is htmlspecialchars_uni(), but I believe vb does more than just that.

Kaelon
09-12-2007, 02:39 PM
Thanks, Paul. I gave that a shot, but strangely, still no luck. Specifically, I used:

$username = htmlspecialchars_uni($username);
$password = htmlspecialchars_uni($password);

... and I still got invalid returns from the system. Then looking further, I also saw that the chat_auth.php code provided by Chris Duerr had already apparently done this analysis:


# Fetch User Info from Database..
$uid = 0;
if ($userinfo = $db->query_first('SELECT userid, usergroupid, membergroupids, password, salt FROM ' . TABLE_PREFIX . 'user WHERE username = "' . addslashes(htmlspecialchars_uni($username)) . '"'))
{
# Invalid Password
if (($userinfo['password'] != $password) && ($userinfo['password'] != md5(md5($password) . $userinfo['salt'])))
$auth = 0;
else
...

Paul M
09-12-2007, 05:32 PM
You need to look in the user datamanager to see what other conversions vb does.

Kaelon
09-12-2007, 06:48 PM
You need to look in the user datamanager to see what other conversions vb does.
Sounds good. Where can I find the user datamanager?

Paul M
09-12-2007, 07:52 PM
class_dm_user.php in the includes folder.

Grim77
05-03-2008, 06:55 AM
Kaelon -- Just curious if we ever found a solution to this? I'm working on the 3.7 mod now, and would like to find a solution that doesn't require a non-standard php library.

Kaelon
05-06-2008, 03:10 PM
Kaelon -- Just curious if we ever found a solution to this? I'm working on the 3.7 mod now, and would like to find a solution that doesn't require a non-standard php library.
Hi Grim77,

Unfortunately, no. Any of my users that have special characters in their usernames (such as accents, which are very common in Romance languages such as Spanish and French) have never been able to log in to our chat room properly. My recommendation would be to definitely allow special characters in the future.

Let me know how your progress goes with regards to this.

Thanks,
Juan

Grim77
05-06-2008, 04:40 PM
Ok, Jaun -- We're working on the next update for v3.7 now. I'll look into this and see what we can do. :)

Kaelon
05-06-2008, 04:49 PM
Ok, Jaun -- We're working on the next update for v3.7 now. I'll look into this and see what we can do. :)
Great, thanks, Chris!

Grim77
05-10-2008, 12:20 AM
The release candidate is now online for v3.7 integration. You can get it from http://forums.addoninteractive.com/showthread.php?t=3915

If you prefer to stick with 3.5/3.6, this is the code I've found to work but admittedly only tested on v3.7, though I don't think the way usernames are stored in the database has changed.

mb_convert_encoding almost does the trick, but not quite. I found the following code posted at php.net, and modified it so that HTML character codes aren't used for anything other than UTF-8 characters in the > 8 bit range, and it also allows for special characters (like '<') -- though some usernames with these special characters aren't permitted by the AddonChat chat software.

I've tested it using various English, Spanish and Arabic characters, and it seems to be working. Again though, if you're running v3.7 -- just download the release candidate and let me know if you run into any problems :)


/*
UTF-8 to Numeric HTML Entity Conversion
Credit to: http://us3.php.net/manual/en/function.utf8-decode.php#75941
** Modified to only return HTML entities for characters out of 8 bit ASCII range.
** Modified to use htmlspecialchars_uni() function.
*/
function utf8_to_html ($data)
{
return htmlspecialchars_uni(preg_replace("/([\\xC0-\\xF7]{1,1}[\\x80-\\xBF]+)/e", '_utf8_to_html("\\1")', $data));
}

function _utf8_to_html ($data)
{
$ret = 0;
foreach((str_split(strrev(chr((ord($data{0}) % 252 % 248 % 240 % 224 % 192) + 128) . substr($data, 1)))) as $k => $v)
$ret += (ord($v) % 128) * pow(64, $k);

if($ret < 256)
return chr($ret);

return "&#$ret;";
}
To use, insert the above code at the end of your authentication script, then find the following code:
if ($userinfo = $db->query_first('SELECT userid, usergroupid, membergroupids, password, salt FROM ' . TABLE_PREFIX . 'user WHERE username = "' . addslashes(htmlspecialchars_uni($username)) . '"')) and replace it with: if ($userinfo = $db->query_first('SELECT userid, usergroupid, membergroupids, password, salt FROM ' . TABLE_PREFIX . 'user WHERE username = "' . $db->escape_string(utf8_to_html($username)) . '"'))