6
votes

I am planning on using the following formula to calculate "trending" posts:

Trending Score = (p - 1) / (t + 2)^1.5

p = votes (points) from users. t = time since submission in hours.

I am looking for advice on how to structure my database tables so that I can query for trending posts with DynamoDB (a nosql database service from Amazon).

DynamoDB requires a Primary Key for each item in a table. The Primary Key can consist of 2 parts: the Hash Attribute (string or number) and the Range Attribute (string or number). The Hash Attribute must be unique for each item and is required. The Range Attribute is optional, but if used DynamoDB will build a sorted range index on the Range Attribute.

The structure I had in mind goes as follows:

TableName: Users

HashAttribute:  user_id
RangeAttribute: NONE
OtherFields: first_name, last_name

TableName: Posts

HashAttribute:  post_id
RangeAttribute: NONE
OtherFields: user_id,title, content, points, categories[ ]

TableName: Categories

HashAttribute:  category_name
RangeAttribute: post_id
OtherFields: title, content, points

TableName: Counters

HashAttribute:  counter_name
RangeAttribute: NONE
OtherFields: counter_value

So here is an example of the types of requests I would make with the following table setup (example: user_id=100):

User Action 1:

User creates a new post and tags the post for 2 categories (baseball,soccer)

Query (1):

Check current value for the counter_name='post_id' and increment+1 and use the new post_id

Query (2): Insert the following into the Posts table:

post_id=value_from_query_1, user_id=100, title=user_generated, content=user_generated, points=0, categories=['baseball','soccer']

Query (3):

Insert the following into the Categories table:

category_name='baseball', post_id=value_from_query_1, title=user_generated, content=user_generated, points=0

Query (4):

Insert the following into the Categories table:

category_name='soccer', post_id=value_from_query_1, title=user_generated, content=user_generated, points=0



The end goal is to be able to conduct the following types of queries:

1. Query for trending posts

2. Query for posts in a certain category

3. Query for posts with the highest point values

Does anyone have any idea how I could structure my tables so that I could do a query for trending posts? Or is this something I give the up the ability to do by switching to DynamoDB?

1
It's better be specific about which database you're using. The various "NoSQL" databases are vastly different.Matthew Flaschen
How often are you going to recalculate the trending posts? Where are you storing the timestamp above? Over what period of time are you willing to age out posts from being eligible for trending?Nick
@Layble I was planning on using the post_id as an incremental counter (so sorting the post_id in descending order would show the latest posts). The reason I was thinking about using post_id vs. a timestamp was so that I could avoid the possibility for duplicate Range Attributes in the categories table (ex: if 2 different users made a post about soccer at the exact same time). I think I would want to recalculate the trending posts at least every minute.Jason Pudzianowski

1 Answers

1
votes

I'm starting with a note on your comment with the timestamp vs post_id.
Since you are going to use DynamoDB as your post_id generator, there is a scalability issue right there. Those numbers are inherently unscalable and you better off using a date object. If you need to create posts in a crazy speed time you can start reading about how twitter are doing it http://blog.twitter.com/2010/announcing-snowflake

Now let's get back to your trending check:
I believe your scenario is misusing DynamoDB.
Let's say you have one HOT category that has most posts in it. Basically you will have to scan the whole posts (since the data isn't spread well) and for each start to look at the points and do the comparisons in your server. This will just not work or will be very expensive since each time you will probably use all your reserved read units capacity.

The DynamoDB approach for those type of trends checking is using MapReduce
Read here how to implement those: http://aws.typepad.com/aws/2012/01/aws-howto-using-amazon-elastic-mapreduce-with-dynamodb.html

I can't specify a time, but I believe you will find this approach scalable - though you won't be able to use it often.

On another note - you could keep a list of the "top 10/100" trendy questions and you update them in "real-time" when a post is upvoted - you get the list, check if it needs to be updated with the newly upvoted question and save it back to the db if needed.