PDA

View Full Version : Fixing Curly Quotes & Em Dashes


SBlueman
03-22-2009, 03:32 PM
As many know the scourge of curly quotes and Em dashes can frustrate you to no end. Some XML readers can't read those characters and you end up with a mess on your hands.

I recently found this online and was wondering if there was a way you can implement this to vBulletin:

http://www.snipe.net/2008/12/fixing-curly-quotes-and-em-dashes-in-php/

The curly quotes, or ?smart quotes? generated by Microsoft Word and other applications can be a real headache to developers. If you?ve built an administration area for your content publishers, and the publishers frequently compose their posts in Word and then copy+paste into your form to publish to the web, you may run into the situation where the curly quotes are replaced by your browser?s version of an unrecognized symbol, often a question mark. This can be particularly frustrating when Word-generated characters such as these curly quotes or em dashes break content-generated XML feeds, even after you?ve been careful enough to convert ?normal? HTML special characters so that your XML would be valid. Fortunately, there is an easy workaround.


Rather than try to convince your publishers to stop using Word to compose their content, the easier (and more effective) solution will be to replace the curly quotes with ?normal? quotes before the data is inserted into the database.

The function below will convert curly quotes and em dashes into standard quotes and dashes ?-?. If you?ve got a handful of classes or functions that you routinely use as part of your data scrubbing process (to clean data before it gets sent to the server), you may want to include this function in that group, that way you don?t ever have to think about it again.01.function convert_smart_quotes($string)
02.{
03.$search = array(chr(145),
04.chr(146),
05.chr(147),
06.chr(148),
07.chr(151));
08.
09.$replace = array("'",
10."'",
11.'"',
12.'"',
13.'-');
14.
15.return str_replace($search, $replace, $string);
16.}

SBlueman
03-25-2009, 04:47 AM
Anyone?

SBlueman
03-26-2009, 05:56 AM
Hello? McFly?

SBlueman
03-28-2009, 03:05 AM
Seriously....anyone???

juanune2
04-30-2010, 09:51 PM
I hate replying to a month-old thread, but I wasn't even a vBulletin user a month ago. : )

I ran into the same problem, and... really can't believe that it exists. This is a glaring hole and you really couldn't use it in a production environment without fixing (at least not as more than a toy). I chose vBulletin over other solutions because of some of the power that it offers wrt communities, so I won't go into a rant... as I still believe I made the right choice.

I'm honestly not sure why this exists. I've found the issue not just on smart quotes, but also on things like ….

I've done some investigation through the code and have determined that there isn't a good way of doing this without touching half of the code in the system, as there is no general-purpose text-cleaning function. Note that since they're high ASCII values, modyifing these is something that should be done immediately upon ingestion, but there doesn't appear to be a good way of doing that. There are even too many ways to enter text from the client-side, and no general text parser in the script (not that you should trust client parsing). It will also be language dependent.

That being said, there is still a bad way of doing it, and I've implemented a brute-force method. Note... I have only done cursory testing with this. I have found no issues as of yet, but please make sure that you do some sanity checks.

STEP 1: [REQUIRED!]
#1 on the list of things to do before using my hack is to move ALL attachments, images, profile images, and all other binary data out of the database and into the filesystem. I have no issues with this, since I'm one of those guys that doesn't believe that you should ever have this kind of data in there in the first place.

Here's info on how, why/why not to do that:
vbulletin docs for moving attachments (http://www.vbulletin.com/docs/html/attachment_storage_fs_to_db)
and
vbulletin docs for moving user pictures (http://www.vbulletin.com/docs/html/userpics_db2fs)

STEP 2
Create a patch file. My file is called functions_custom.php, and lives in a 'custom' directory off of my root. It is a bit verbose (I chose a format that could be easily read), but there the full text:
<?php
function convert_extraspecial_chars($string)
{
// uncomment the following line to turn the functionality off.
//return $string;

$search = array();
$replace = array();

$search[] = chr(130);
$replace[] = '\'';
$search[] = chr(131);
$replace[] = '';
$search[] = chr(132);
$replace[] = '';
$search[] = chr(133);
$replace[] = '...';
$search[] = chr(134);
$replace[] = '';
$search[] = chr(135);
$replace[] = '';
$search[] = chr(136);
$replace[] = '';
$search[] = chr(137);
$replace[] = '';
$search[] = chr(138);
$replace[] = '';
$search[] = chr(139);
$replace[] = '';
$search[] = chr(140);
$replace[] = '';
$search[] = chr(174);
$replace[] = '(r)';
$search[] = chr(175);
$replace[] = '(c)';



$search[] = chr(145);
$replace[] = '\'';
$search[] = chr(146);
$replace[] = '\'';
$search[] = chr(147);
$replace[] = '"';
$search[] = chr(148);
$replace[] = '"';
$search[] = chr(149);
$replace[] = '"';
$search[] = chr(150);
$replace[] = '*';
$search[] = chr(151);
$replace[] = '-';
$search[] = chr(152);
$replace[] = '-';
$search[] = chr(153);
$replace[] = '';
$search[] = chr(154);
$replace[] = 'tm';
$search[] = chr(155);
$replace[] = '';
$search[] = chr(156);
$replace[] = '\'';


return str_replace($search, $replace, $string);
}
?>

STEP 3
Hook your patch into the only place that seems to be a culmination point for input parsing, which is escape_string() inside of class_core.php (around line 717):

function escape_string($string)
{
require_once('../custom/functions_custom.php');
$string = convert_extraspecial_chars($string);
if ($this->functions['escape_string'] == $this->functions['real_escape_string'])
{
return $this->functions['escape_string']($string, $this->connection_master);
}
else
{
return $this->functions['escape_string']($string);
}
}


What this does is force an extra cleaning pass for all info that passes into the DB, stripping certain high-ascii values. If you tried to skip step 1, you'll wind up corrupting 99% of binary files uploaded to the system... so don't do that.

I can't comment on general usage, since I've only worked through this on one pre-production installation, but it is working nicely here. As such, put this through a test pass before using it. If it works, great... put some comments in here. If it doesn't... maybe I can offer some assistance.

-- j

--------------- Added 1272684214 at 1272684214 ---------------

Ok, after dealing with this a bit more...

I changed the map file so that the mappings strip out high ascii characters instead of putting them through an HTML entities setup. Although... the real problem lies in how the database calls are structured. Again, the fix would mean changing every call in the system, and I'm sure that nobody is keen to do that.