View Full Version : Removing HTML Code From Posts
ConqSoft
01-12-2004, 09:32 PM
I converted my site from IkonBoard a while back, and all the old posts are full of HTML formatting code that was imported. I have HTML turned off on VB, so of course, the messages show up with all this ugly HTML code in them. Is there an easy way I can strip all that out?
Thanks.
Example Thread: http://www.fireblades.org/forums/showthread.php?t=6708
NTLDR
01-12-2004, 09:38 PM
There are a few options to this I guess. Assuming the HTML is pretty much the same in alot of posts, for example HTML for a quote box, you could use vB replacements to just hide it.
Alternativly you'd need a script to run through and grab each post, either strip all HTML or try and convert it to vBcode and then update the database. Going on your 97k posts this would take a while todo as you'd have to have around the same number of queries to perform the operation.
Xenon
01-12-2004, 09:52 PM
or you can try this way:
open includes/functions_bbcodeparse.php
find// ###################### Start bbcodeparse2 #######################
function parse_bbcode2($bbcode, $dohtml, $dobbimagecode, $dosmilies, $dobbcode, $iswysiwyg = 0, $donl2br = 1)
{
// parses text for vB code, smilies and censoring
global $DB_site, $vboptions, $bbuserinfo, $templatecache, $smiliecache;
global $html_allowed;
and below add:
$bbcode = strip_tags($bbcode);
ConqSoft
01-12-2004, 10:09 PM
Thanks.
Just tried that, but nothing changed. :(
Xenon
01-12-2004, 10:12 PM
hmm, that should work normally...
well i don't beleive it, but maybe try this:
$bbcode = strip_tags(unhtmlspecialchars($bbcode));
ConqSoft
01-12-2004, 10:14 PM
Nope. Same.
Xenon
01-12-2004, 10:16 PM
are you sure you have not enabled post caching?
if you have it enabled, you won't see fresh parsed posts ;)
ConqSoft
01-12-2004, 10:18 PM
Ooops. :o
Yep, post caching is enabled.
Andreas
01-12-2004, 10:23 PM
I'd close the board and run an update-script to strip all HTML.
Seems to be the better solution to do this one-time and not "on the fly" all the time.
ConqSoft
01-12-2004, 10:26 PM
Looks like it worked for the most part. Some non-breakable spaces are remaining, but much better then before.
Thanks.
Example: http://www.fireblades.org/forums/showthread.php?t=33
ConqSoft
01-12-2004, 10:26 PM
I'd close the board and run an update-script to strip all HTML.
Seems to be the better solution to do this one-time and not "on the fly" all the time.
Sounds good to me? So where's that update script buddy? ;)
Andreas
01-12-2004, 11:02 PM
<?php
error_reporting(E_ALL & ~E_NOTICE);
require_once('./global.php');
echo "Stripping HTML from all posts, please stand by ...<br>";
$DB_site->query("LOCK TABLES post WRITE");
$posts= $DB_site->query("SELECT * FROM " . TABLE_PREFIX . "post");
$i = 0;
while ($post = $DB_site->fetch_array($posts)) {
// Remove the HTML
$post['pagetext'] = strip_tags(unhtmlspecialchars($post['pagetext']));
// Remove non-breaking spaces
$post['pagetext'] = preg_replace("'&(nbsp|#160);'si", "", $post['pagetext']);
// Write the post back
$DB_site->query("UPDATE " . TABLE_PREFIX . "post SET pagetext='" . addslashes($post['pagetext']) . "' WHERE postid=" . $post['postid']);
if ($i % 100 == 0) {
echo ".";
flush();
}
$i++;
}
$DB_site->query("UNLOCK TABLES");
echo "<br>Finished!"
?>
No warranties. Please note that this script wil take kinda long to finish (depending on the size of you board). No warranties.
NTLDR
01-12-2004, 11:08 PM
Firstly that script WILL timeout if you run it from the browser. I'd also recomend only selecting the pagetext and postid from the post table and getting it to do only X posts per page.
If you have SSH access then I'd recomend you close the board and run it via the command prompt.
Xenon
01-12-2004, 11:15 PM
@Conq: hehe, i would have wondered already :)
good it's working now.
Also i have to agree with Lee that script will timeout, and it's also not well coded (sorry to say)
regarding you have 97k posts, this script will run 97k queries....
it should be strongly optimized before thinking of running it...
TosaInu
07-29-2004, 11:50 PM
It doesn't remove but it does delete the <bits>.
A batch script to strip HTML would be of great help (Ikonboard and 100,000's posts).
ConqSoft
07-29-2004, 11:52 PM
The new Import system for vb3 has an extra little utility included that can be used to strip whatever code/text you want. I used it to clean up my database. Worked great.
TosaInu
07-30-2004, 06:10 AM
Hello ConqSoft,
tools\cleaner.php?
ConqSoft
07-30-2004, 11:10 AM
Hello ConqSoft,
tools\cleaner.php?
Yes. I modifed it a bit, to call the strip_tags() function before it did the replacements. I used the replacements to replace the with blank, the " with ", etc.
TosaInu
08-04-2004, 02:41 PM
Hello ConqSoft,
I've been .. not too smart, by replacing " with a blank and also stripped all exclamation marks. Also forgot the &
What does strip_tags() do? Is that the way to get rid of <html code align =""> that mess?
We still have a lot of that :(
Can you tell me how to modify the code please? I'm tempted to run it again and strip more.
The script timed out when max-exec was set to 600 seconds. It went ok at 1200.
How do you replace " ? It expects data between "", this would give """ ? Isn't that confusing the script? How do you replace say "center" by "left"? ""left""
Talking about parser. It's no longer possible to add say 3 spaces between words. This is a real pain for certain authors for good reasons. Is it possible to turn that off?
TosaInu
08-06-2004, 09:50 PM
Hello,
How should the strip_tags() function be include with cleaner.PHP?
ConqSoft
08-06-2004, 10:03 PM
In the "Posts" section, add
$text = strip_tags($text);
right ABOVE
$text = str_replace(array_keys($replacer), $replacer, $post['pagetext']);
TosaInu
08-06-2004, 11:06 PM
Thanks, I'm going to run it again.
TosaInu
08-07-2004, 10:21 PM
Hello ConqSoft,
That didn't work for me?
This did:
$post['pagetext'] = strip_tags($post['pagetext']);
ConqSoft
05-29-2006, 06:07 PM
Never did update this thread... The cleaner.php tool from Jelsoft did the trick.
Pure Dope
02-14-2007, 01:26 AM
is there a way to remove ONLY certain HTML?
like......the REDIRECT META TAGS?
is that easy? I dont even need to remove it. i would like it to stay...maybe just handicap it so it doesnt work!
vBulletin® v3.8.12 by vBS, Copyright ©2000-2025, vBulletin Solutions Inc.