Log in

View Full Version : Sphinx: WARNING: duplicate document ids found


FractalizeR
05-04-2010, 02:13 PM
The following is the output of cronjob /usr/local/sphinx/cron/delta.sh:

Sphinx 0.9.8-id64-release (r1371)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/sphinx/etc/vbulletin-sphinx.php'...
indexing index 'ForumDelta'...
collected 43 docs, 0.0 MB
collected 1 attr values
sorted 0.0 Mvalues, 100.0% done
sorted 0.0 Mhits, 100.0% done
total 43 docs, 5030 bytes
total 0.014 sec, 361480.03 bytes/sec, 3090.19 docs/sec
rotating indices: succesfully sent SIGHUP to searchd (pid=11570).
Sphinx 0.9.8-id64-release (r1371)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/sphinx/etc/vbulletin-sphinx.php'...
indexing index 'ThreadPostDelta'...
collected 2966 docs, 1.2 MB
collected 588 attr values
sorted 0.0 Mvalues, 100.0% done
sorted 0.1 Mhits, 84.8% done
WARNING: duplicate document ids found
total 2966 docs, 1159212 bytes
total 122.929 sec, 9429.94 bytes/sec, 24.13 docs/sec
rotating indices: succesfully sent SIGHUP to searchd (pid=11570).
Sphinx 0.9.8-id64-release (r1371)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/sphinx/etc/vbulletin-sphinx.php'...
indexing index 'DiscussionMessageDelta'...
collected 0 docs, 0.0 MB
collected 1 attr values
sorted 0.0 Mvalues, 100.0% done
total 0 docs, 0 bytes
total 0.034 sec, 0.00 bytes/sec, 0.00 docs/sec
rotating indices: succesfully sent SIGHUP to searchd (pid=11570).
Sphinx 0.9.8-id64-release (r1371)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/sphinx/etc/vbulletin-sphinx.php'...
indexing index 'SocialGroupDelta'...
collected 0 docs, 0.0 MB
collected 1 attr values
sorted 0.0 Mvalues, 100.0% done
total 0 docs, 0 bytes
total 0.010 sec, 0.00 bytes/sec, 0.00 docs/sec
rotating indices: succesfully sent SIGHUP to searchd (pid=11570).
Sphinx 0.9.8-id64-release (r1371)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/sphinx/etc/vbulletin-sphinx.php'...
indexing index 'VisitorMessageDelta'...
collected 0 docs, 0.0 MB
collected 1 attr values
sorted 0.0 Mvalues, 100.0% done
total 0 docs, 0 bytes
total 0.014 sec, 0.00 bytes/sec, 0.00 docs/sec
rotating indices: succesfully sent SIGHUP to searchd (pid=11570).
Sphinx 0.9.8-id64-release (r1371)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/sphinx/etc/vbulletin-sphinx.php'...
indexing index 'BlogEntryDelta'...
collected 0 docs, 0.0 MB
collected 0 attr values
sorted 0.0 Mvalues, nan% done
total 0 docs, 0 bytes
total 0.046 sec, 0.00 bytes/sec, 0.00 docs/sec
rotating indices: succesfully sent SIGHUP to searchd (pid=11570).
Sphinx 0.9.8-id64-release (r1371)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/sphinx/etc/vbulletin-sphinx.php'...
indexing index 'BlogCommentDelta'...
collected 0 docs, 0.0 MB
collected 1 attr values
sorted 0.0 Mvalues, 100.0% done
total 0 docs, 0 bytes
total 0.010 sec, 0.00 bytes/sec, 0.00 docs/sec
rotating indices: succesfully sent SIGHUP to searchd (pid=11570).
Sphinx 0.9.8-id64-release (r1371)
Copyright (c) 2001-2008, Andrew Aksyonoff

using config file '/usr/local/sphinx/etc/vbulletin-sphinx.php'...
indexing index 'CMSArticlesDelta'...
collected 0 docs, 0.0 MB
collected 1 attr values
sorted 0.0 Mvalues, 100.0% done
total 0 docs, 0 bytes
total 0.011 sec, 0.00 bytes/sec, 0.00 docs/sec
rotating indices: succesfully sent SIGHUP to searchd (pid=11570).

Please look at ThreadPostDelta indexing:

WARNING: duplicate document ids found message appears. Is that a normal behavior of Sphinx? What is the document id used?

sung
05-04-2010, 03:41 PM
I got the warning as well (so glad it isn't just me), which I've reported in the vbulletin.com forums.

It can cause all sorts of nasty problems (http://www.sphinxsearch.com/docs/current.html#data-restrictions) with Sphinx.

There are a few different restrictions imposed on the source data which is going to be indexed by Sphinx, of which the single most important one is:

ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS (32-BIT OR 64-BIT, DEPENDING ON BUILD TIME SETTINGS).

If this requirement is not met, different bad things can happen. For instance, Sphinx can crash with an internal assertion while indexing; or produce strange results when searching due to conflicting IDs. Also, a 1000-pound gorilla might eventually come out of your display and start throwing barrels at you. You've been warned.

FractalizeR
05-04-2010, 08:42 PM
The following combination is used in configuration file to make so-called Document ID, that MUST be unique:

SELECT (c.contenttypeid << 32) | (p.postid) AS id

On some reason, it appears non-unique. However, I don't see how it can be other than really duplicating rows are returned by complete query

graham_w
06-19-2010, 11:58 PM
Did you ever sort this out - i'm noticing the same error.

Cheers

FractalizeR
06-20-2010, 06:49 AM
No, but it looks like it doesn't affect search quality.

graham_w
06-20-2010, 07:53 AM
Thanks for the reply - yeah I did find a thread saying similar on the sphinx website.

Cheers

JesterP
06-20-2010, 05:35 PM
Thanks for the reply - yeah I did find a thread saying similar on the sphinx website.

Cheers

I recieved in my inbox this morning:

--->8---

### SAVE ORDERED IDS TO SEARCH CACHE ###;

MySQL Error : Duplicate entry '92f3f32f09b269797e91242ce55639a6-lastpost-DESC' for key 2
Error Number : 1062
Request Date : Sunday, June 20th 2010 @ 10:44:01 AM

---8<---
Everything is still running and I am not seeing anything bad happening. No errors since.