8
votes

I am implementing web application similar to Twitter. I need to implement 'retweet' action, and one tweet can by retweeted by one person multiple times.

I have a basic 'tweets' table that have columns for:

Tweets: tweet_id | tweet_text | tweet_date_created | tweet_user_id

(where tweet_id is primary key for tweets, tweet_text contains tweet text, tweet_date_created is the DateTime when tweet was created and tweet_user_id is the foreign key to users table and identifies user who has created the tweet)

Now I am wondering how should I implement the retweet action in my database.

Option 1

Should I create new join table, which would look like this:

Retweets: tweet_id | user_id | retweet_date_retweeted

(Where tweet_id is a foreign key to tweets table, user_id is a foreign key to users table and identifies user who has retweeted the tweet, retweet_date_retweeted is a DateTime which specifies when the retweet was done.)

pros: There will be no empty columns, when user process reteet, new line in retweets table will be created.

cons: Querying process will be more difficult, it will need to join two tables and somehow sort the tweets by two dates (when tweet is not retweet, sort it by tweet_date_created, when tweet is retweet, sort it by retweet_date_retweeted).

Option 2

Or should I implement it in the tweets table as parent_id, it will then look like this:

Tweets: tweet_id | tweet_text | tweet_date_created | tweet_user_id | parent_id

(Where all the columns remains the same and parent_id is a foreign key to the same tweets table. When tweet is created, parent_id remains empty. When tweet is retweeted, parent_id contains origin tweet id, tweet_user_id contains user which processed the retweet action, tweet_date_created contains the DateTime when retweet was done, and tweet_text remains empty - becouse we will not let users change the original tweet when retweeting.)

pros: Querying process is much more elegant, as I do not have to join two tables.

cons: There will be empty cells every time tweet is retweeted. So if I have 1 000 tweets in my database and every of them is retweeted for 5 times, there will be 5 000 lines in my tweets table.


Which is the most efficient way? Is it better to have empty cells or to have querying process more clean?

2

2 Answers

9
votes

IMO option #1 would be better. The query to join the tweet and retweet tables would not be at all complex and could be done via a left or inner join, depending on whether you want to show all tweets or only tweets which were retweeted. And the join query should be performant as the table is narrow, the columns being joined are ints, and they will each have indices due to the FK constraints.

Another recommendation is not to label all your columns with tweet or retweet, those can be inferred from the table in which the data is stored, for example:

tweet
    id
    user_id
    text
    created_at

retweet
    tweet_id
    user_id
    created_at

And sample joins:

# Return all tweets which have been retweeted
SELECT
    count(*),
    t.id
FROM
    tweet AS t
INNER JOIN retweet AS rt ON rt.tweet_id = t.id
GROUP BY
    t.id

# Return tweet and possible retweet data for a specific tweet
SELECT
    t.id
FROM
    tweet AS t
LEFT OUTER JOIN retweet AS rt ON rt.tweet_id = t.id
WHERE
    t.id = :tweetId

-- Update per request --

The following is demonstrative only, representing why I would opt for option #1, there are no foreign keys nor are there any indices, you will have to add these yourself. But the results should demonstrate that the joins won't be too painful.

CREATE TABLE `tweet` (
    `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
    `user_id` int(10) unsigned NOT NULL,
    `value` varchar(255) NOT NULL,
    `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=8 DEFAULT CHARSET=utf8

CREATE TABLE `retweet` (
    `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
    `tweet_id` int(10) unsigned NOT NULL,
    `user_id` int(10) unsigned NOT NULL,
    `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=3 DEFAULT CHARSET=utf8;

# Sample Rows

mysql> select * from tweet;
+----+---------+----------------+---------------------+
| id | user_id | value          | created_at          |
+----+---------+----------------+---------------------+
|  1 |       1 | User1 | Tweet1 | 2012-07-27 00:04:30 |
|  2 |       1 | User1 | Tweet2 | 2012-07-27 00:04:35 |
|  3 |       2 | User2 | Tweet1 | 2012-07-27 00:04:47 |
|  4 |       3 | User3 | Tweet1 | 2012-07-27 00:04:58 |
|  5 |       1 | User1 | Tweet3 | 2012-07-27 00:06:47 |
|  6 |       1 | User1 | Tweet4 | 2012-07-27 00:06:50 |
|  7 |       1 | User1 | Tweet5 | 2012-07-27 00:06:54 |
+----+---------+----------------+---------------------+

mysql> select * from retweet;
+----+----------+---------+---------------------+
| id | tweet_id | user_id | created_at          |
+----+----------+---------+---------------------+
|  1 |        4 |       1 | 2012-07-27 00:06:37 |
|  2 |        3 |       1 | 2012-07-27 00:07:11 |
+----+----------+---------+---------------------+

# Query to pull all tweets for user_id = 1, including retweets and order from newest to oldest

select * from (
    select t.* from tweet as t where user_id = 1
    union
    select t.* from tweet as t where t.id in (select tweet_id from retweet where user_id = 1))
a order by created_at desc;

mysql> select * from (select t.* from tweet as t where user_id = 1 union select t.* from tweet as t where t.id in (select tweet_id from retweet where user_id = 1)) a order by created_at desc;
+----+---------+----------------+---------------------+
| id | user_id | value          | created_at          |
+----+---------+----------------+---------------------+
|  7 |       1 | User1 | Tweet5 | 2012-07-27 00:06:54 |
|  6 |       1 | User1 | Tweet4 | 2012-07-27 00:06:50 |
|  5 |       1 | User1 | Tweet3 | 2012-07-27 00:06:47 |
|  4 |       3 | User3 | Tweet1 | 2012-07-27 00:04:58 |
|  3 |       2 | User2 | Tweet1 | 2012-07-27 00:04:47 |
|  2 |       1 | User1 | Tweet2 | 2012-07-27 00:04:35 |
|  1 |       1 | User1 | Tweet1 | 2012-07-27 00:04:30 |
+----+---------+----------------+---------------------+

Notice in the last set of results, that we were able to also include the retweets and display the retweet of #4 before the retweet of #3.

-- Update --

You can accomplish what you are asking for by changing the query a bit:

select * from (
    select t.id, t.value, t.created_at from tweet as t where user_id = 1
    union
    select t.id, t.value, rt.created_at from tweet as t inner join retweet as rt on rt.tweet_id = t.id where rt.user_id = 1)
a order by created_at desc;

mysql> select * from (select t.id, t.value, t.created_at from tweet as t where user_id = 1 union select t.id, t.value, rt.created_at from tweet as t inner join retweet as rt on rt.tweet_id = t.id where rt.user_id = 1) a order by created_at desc;
+----+----------------+---------------------+
| id | value          | created_at          |
+----+----------------+---------------------+
|  3 | User2 | Tweet1 | 2012-07-27 00:07:11 |
|  7 | User1 | Tweet5 | 2012-07-27 00:06:54 |
|  6 | User1 | Tweet4 | 2012-07-27 00:06:50 |
|  5 | User1 | Tweet3 | 2012-07-27 00:06:47 |
|  4 | User3 | Tweet1 | 2012-07-27 00:06:37 |
|  2 | User1 | Tweet2 | 2012-07-27 00:04:35 |
|  1 | User1 | Tweet1 | 2012-07-27 00:04:30 |
+----+----------------+---------------------+
1
votes

I would choose option 2 with slight modification. Column parent_id in tweets table should point to itself if it is not a retweet. Then, the querying will be extremely easy:

SELECT tm.Id, tm.UserId, tc.Text, tm.Created, 
    CASE WHEN tm.Id <> tc .Id THEN tm.UserId ELSE NULL END AS OriginalAsker
FROM tweet tm
LEFT JOIN tweet tc ON tm.ParentId = tc.Id
ORDER BY tm.Created DESC

(tc is parent table - the one with content.. it has tweet's text, original poster's Id, etc.)

The reason for introducing rule about pointing to itself if not retweet is that then it is easy to add more joins to original tweet. You just join a table with tc and don't care if it is retweet or not.

Not only the query is easy, but it will also perform much better than option 1, because sorting is done using only one physical column, which can be indexed.

The only drawback is that the DB will be a little bit larger.