0
votes

I'm setting up Sphinx to search a database a lot like what I imagine is at the heart of Stack Overflow -- a posts table, and a tags table. Each tag has a post_id to identify the post with which it's associated. The site is built around CakePHP 2.2.3.

What I want is to be able to search for topics. Maybe the topic name occurs in the text of the post, or maybe in the tags associated with it (sometimes users are lazy about tagging).

My naive implementation (first time using Sphinx, and first time getting my hands dirty with MySQL) has been to set up Sphinx with a query like the below and search the associated index:

SELECT posts.id, 
       posts.text,
       tags_concat.all_tags 
FROM   posts 
       LEFT JOIN(SELECT tags.post_id, 
                        GROUP_CONCAT(tags.name) AS all_tags 
                 FROM   tags 
                 GROUP  BY tags.post_id) AS tags_concat 
              ON tags_concat.post_id = posts.id 
WHERE  posts.id >= $start 
       AND posts.id <= $end 

I've used what seems to be the ubiquitous Sphinx behaviour to do this, and the results are awesome! However, the database is big enough that I'll need to use a delta index to keep reasonably up-to-date search results.

Most of the delta schemes I've seen might (for example) track the highest posts.id in the main index, and place anything above that in the delta index. If a user edits a post, or adds or edits a tag, however, it could be a week before that's reflected in either index, since it won't result in a new row in "posts". Unlike Stack Overflow, much of my tagging will occur after a post is made, and posts will absolutely be edited after they're created (they're built by the community), so this is a significant issue.

Am I missing something in the implementation of the delta index, or an alternative to delta indexes, or a tweak to the implementation that might make searches more reliable?

Edit

Thinking on this some more, I realize that the delta index should still be updated if the post or tags are edited after post creation, with one exception -- posts created before the main index is rebuilt but edited afterward (even if there's only an hour between those two events) won't show in the delta. Can I flag posts when they're changed (with a posts.in_delta column, for example), then somehow have the results from the delta index override those from the main index?

2

2 Answers

0
votes

My prefered solution is to use a delta ids table, somthing like, delta config...

sql_query_pre = SELECT @max_id:=max_id, @updated:=updated FROM counter
sql_query_pre = CREATE TEMPORARY TABLE delta_ids (id INT UNSIGNED NOT NULL PRIMARY) \
    SELECT id FROM posts WHERE updated > @updated AND id <= @max_id
sql_query_pre = INSERT IGNORE INTO delta_ids \
    SELECT DISTINCT post_id FROM tags WHERE updated > @updated AND post_id <= @max_id


sql_query = SELECT ... WHERE post.id IN(SELECT id FROM delta_ids) OR post_id > @max_id

sql_query_killlist = SELECT id FROM delta_ids

... which gives a nice list of ids to use in a kill list :)

Where the main has a pretty standard counter table, although it needs a timestamp as well as last post id.

sql_query_pre = REPLACE INTO counter SELECT MAX(id) AS max_id, MAX(updated) AS updated FROM post

sql_qyery_range = SELECT 1, max_id FROM counter

I find it most reliable to put the counter in a _pre query (rather than post) so that the main and be clearly filtered by exact same criteria.

(all typed from memory, might want to experiment to get the queries as efficient as possible. )

0
votes

Considering the following, and thought I'd leave it here for comment or as a potential answer for others. Adding this to my main source's definition:

sql_query_pre = SET NAMES utf8
sql_query_pre = UPDATE posts SET in_delta = 0
sql_query_pre = UPDATE posts SET in_delta = 1 WHERE posts.created > DATE_SUB(NOW(), INTERVAL 7 DAY)

where posts.created is a DATETIME set at creation, and posts.in_delta has been added for this. Have also added "WHERE posts.in_delta = 0" to the sql_query for the main index, and "WHERE posts.in_delta = 1" to that for the delta index (but of course, only the first of the sql_query_pre lines there).

My rationale is to give posts a "cooling down" period of 7 days, in this case, to be edited, re-tagged, and so on. With the above, my expectation is that the indices will mostly reflect changes, even if a user is unlucky enough to create a post immediately before the main index is updated.