How to implement Twitter tweet, reply and retweet database schema without empty columns/cells

Question

How would you design a database for a twitter like app in the best and most performant way possible?

In this app a user can tweet, reply and retweet. Tweeting, replying and retweeting create, in the end, a new tweet. So we can easily create a schema like this:

tweets
  - id - integer
  - text - string -> this is non nullable
  - userId - integer
  - createdAt - date
  - parentId - integer -> when replying, this will be the parent tweet of the reply
  - isRetweet - boolean -> this becomes true only when the current tweet is a retweet. The parentId will have the id of the tweet that was retweeted

But let's introduce another feature: the users can retweet without writing a text(this is how Twitter currently works). In this case we have to make our text column nullable. Suddenly, this means we can create tweets and replies without text because there are no database constraints about this anymore (let's ignore the application layer validations for now).

My approach was to create 3 more tables for this. In the end we will have 4 tables:

tweets
  - id - integer
  - text - string
  - userId - integer
  - createdAt - date

replies
  - tweetId - integer -> this will contain the content of the reply
  - parentId - integer -> this is the parent of the reply

retweetsWithText
  - tweetId - integer -> this will contain the content of the retweet
  - parentId - integer -> this is the parent of the retweet

retweetsWithoutText
  - tweetId - integer -> this will contain the reference for the retweeted tweet
  - userId - integer -> who retweeted the tweet. We need this because we won't save anything in the `tweets` table regarding retweets without text
  - createdAt - date

I know this might be complex to query but in the end I think this will bring the best performance in terms of storage (no need for parentId and isRetweet). This will also fix the problem of retweets with no text. Another problem with the first approach is that parentId will be most likely null because, based on statistics, only 30% of tweets get comment or retweeted. So we will end up with a lot of NULL columns for parentId

Is there a simpler approach for this problem with less tables and preferably with no empty/redundant cells?

amirouche amirouche · Accepted Answer · 2020-11-02T07:59:47

In your last schema with multiple tables: how do you plan to query the timeline of an user without querying multiple tables, then merge result, and eventually slice the result of then merge to allow pagination?

That is to say, even if the approach that rely on NULL might lead to used space on disk, it is also faster to query.

In database schema design, there is always a trade-off to be made between the speed of insertion and the speed of querying. In the case of a twitter-like application, speed of querying wins (which is the case most of the time in OLTP workload).

You need to benchmark both solutions, and prolly add a third option, if you use PostgreSQL, rely on table inheritance.

TIPS: I have no good knowledge of PostgreSQL on-file serialization, but if it is similar to FoundationDB tuple module, encoding a NULL value in a row's column can take only 1 byte.

How to implement Twitter tweet, reply and retweet database schema without empty columns/cells

1 Answers