Yesterday when I tried the newsgroup again from scratch, I did not have any better success than before, as I had thought. What happened was that I had the days to keep orphaned replies set to 7 days, so it deleted the ones that were older than 7 days. So today, I tried it with a setting that would be longer than any of the dates of the messages that are in the newsgroup and there were 69 that could not be inserted into the forums (and 468 that were put in the forums).
Anyways, looking at the ones that did not get inserted, I had these cases: - A thread starting post had two direct replies, both had the reds set to the msgid of the first post in the thread. However, the one that was not inserted had an order of 2 while the parent post had an ord or 0 and the other reply had an ord of 1. So that didn't fit the ref matching along with the ord being equal to or one more that the parent post. So it didn't get inserted. So, I changed the MySQL call in the newnews.pl file to ignore the ord. So on line 404, I removed "((d.ord = a.ord) OR (d.ord +1 = a.ord))"so that I now have:
Code:
my $q3 = db_fetch("SELECT b.title, a.nntpposter, a.forum, a.msgid, a.dtm, a.subject, a.poster, a.body, a.ord, a.postid, a.email, b.threadid, c.ref FROM usenet_article AS a, thread AS b, usenet_ref AS c, post AS d where b.threadid = d.threadid and b.forumid = $$newsgroup->{forumid} and c.ref = d.msgid and a.msgid = c.msgid ORDER BY a.dtm, a.ord");
That made it where this post was then inserted into the proper thread along with 4 others that had the same problem and brought the total in usenet_article to 64.
The rest of the cases are now without the ord part of the MySQL query.
- A message has two refs in it and both those refs are associated with posts in the post table, however, these referenced posts are in different threads. Also, the original message on the newsgroup does not have any references associated with it. The refs associated with this message both had the same title "(no subject)". So the orphaned replies code must have associated this post with both of these messages.
- A reply to a message that had a different title than what it's reference was and then the post that was not inserted was a reply to that message that had no refs in the headers. The original post was from a newbie to newsgroups and replied to a message to start a new thread.
- One post was a reply to a post, but the reference for that post did not match the parent post (the message did have quotes from the parent post though).
- Some posts have weird subject lines associated with it, such as one person's reply to a message titled "My Post" ends up being "Re: [My Post]", and a reply to another replied message ends up being "Re: [Re: My Post]". These posts do not have an references in them on the newsgroup, . Also, there were instances of subjects like "Re(2): My Post" and "Re: Re(2):".
- Some posts to the mailing list do not make it to the newsgroup, so the replies to that get trapped in the usenet_article table.
- Some posts looked like they matched all the conditions to be inserted, with the refs being associated with the post(s) that id goes to. I went through the MySQL query and everything matched. So I added some console() calls to see if those messages did get selected and they did. So, I added some more console() statements to determine what's happening, so starting at line 440, I put in:
Code:
console("\nTrying post from $poster:");
if (db_execute("INSERT INTO post (allowsmilie,threadid,username,dateline,pagetext,visible,ord,msgid,userid,ipaddress,isusenetpost,seq) VALUES ($config{allowsmilies},$threadid,$poster,$dtm,$fbody,'1',$ord,$msgid,$userid,$nntpposter,1,$seq+1)",1)) {
console(" posted!");
$postid = $dbh->{'mysql_insertid'};
db_execute("DELETE FROM usenet_article WHERE msgid = $msgid");
db_execute("DELETE FROM usenet_ref WHERE msgid = $msgid");
my $q4 = db_fetch("SELECT lastpost FROM thread WHERE threadid=$threadid");
my ($lastpost) = $q4->fetchrow_array;
if (!$lastpost) { $lastpost = $dtm; }
db_execute("UPDATE thread SET replycount = $seq ".(($dtm >= $lastpost)?",lastpost=$dtm,lastposter=$poster":"")." WHERE threadid=$threadid");
my $q5 = db_fetch("SELECT lastpost FROM forum WHERE forumid=$forumid");
$lastpost = $q5->fetchrow_array;
if (!$lastpost) { $lastpost = $dtm; }
db_execute("UPDATE forum SET replycount=replycount + 1 ".(($dtm >= $lastpost)?",lastpost=$dtm,lastposter=$poster":"")." WHERE forumid=$forumid");
indexpost($postid);
push(@updated_threads,$threadid);
}
else { console($DBI::errstr); }
This resulted in these results:
Getting article batch from rec.sport.unicycling
No new messages in rec.sport.unicycling
inserting new threads into forums
inserting replies into forums
Trying post from 'Jonathan Marsha' uplicate entry '<CSujlIAW5Mw6Eww0@jbmarshl.demon.co.uk>' for key 5
Trying post from 'Mark Wiggins' uplicate entry '<3AC0D05A.620D0A71@ftel.co.uk>' for key 5
Trying post from 'Mark Wiggins' uplicate entry '<3AC0D05A.620D0A71@ftel.co.uk>' for key 5
Trying post from 'Mark Wiggins' uplicate entry '<3AC0D05A.620D0A71@ftel.co.uk>' for key 5
Trying post from 'Chuck Webb' uplicate entry '<As6w6.166$9d.54655@newshog.newsread.com>' for key 5
Trying post from 'Greg House' uplicate entry '<5Xew6.151$pn3.483290@nntp3.onemain.com>' for key 5
Trying post from 'Greg House' uplicate entry '<5Xew6.151$pn3.483290@nntp3.onemain.com>' for key 5Processing outgoing messages
Clean disconnection from news.tc.umn.edu
Um, ok, now I figured that one out since those posts are already in the forum somehow ended up in the usenet_article table many times.
So in conclusion, the rec.sport.unicycling newsgroup is pretty messed up as far as the posts being properly threaded. The newsgroup is also an email mailing list so that causes some pretty strange things for referenences (such as no references) and the formatting of the subject line.
|