Hbase Schema design

Question

I have to design an Hbase table to store users information, this information is targeted for social networking, like: age, sex, education, hobbies, read books, traveled countries ... NOTE: we could add more information in future, we dont know all information now.

for example: name: Olha, age: 25, sex: female, education: bachelor Information technology, education: master computer science, hobby: basket ball, hobby: ping pong, book: gone with the wind, book: Davinci code, language: english, language: french, Country: Germany

The main idea is to be able to do queries like: return all people who are female, age: 22 years old, speak: english, speak: french, read the book gone with the wind, like ping pong, like basket ball and German.

so you can add any criteria to the search query.

what is your suggestion about the HBASE table schema ( row key, column family ... ) that optimized this kind of search queries ( taking into consideration that we will add more information in future ) what is the best way to write such query ( scan, get, MapReduce ).

Thank you

I don't think HBase is a good choice for complex&dynamic queries. — ericson
For this kind of highly interconnected entities I would consider Graph databases like Neo4J or Titan. Depending on your requirements regarding replication, availability and maturity. — LMeyer
this is a kind of a research project, so i have to use Hbase. — user1027364

terrance.a.snyder terrance.a.snyder · Accepted Answer · 2013-06-09T16:48:45

I would agree with Ian Varley that Solr/Lucene and it's faceted queries and joins allow you to pivot the data in the way you want to see it - however - I also think your question might be a "counting" question or a "membership" question....

It sounds like you are after a list of people who match (N) attributes - the problem you have is that for each attribute you could have millions of user ids?

HBase is a good fit when all you are trying to do is compute intersection/union sizes.. Your key/value pairs can be put into Hbase, and you can "encode" the IDs of the users into either a Bloom Filter and HyperLogLog. Trading speed for accuracy and memory. Likely running map/reduce style jobs hourly/nightly on click-streams of log aggregation of some type.

Others have done this in the advertising space and online space for exactly the type of queries you are running ("find people who like red bull and pop-tarts that live in florida")

References

Contextual Advertising using Apache Hive and Amazon EMR http://aws.amazon.com/articles/2855

Scaling Distributed Counters: http://whynosql.com/scaling-distributed-counters/

Google: Sharding counters https://developers.google.com/appengine/articles/sharding_counters

Distributed Counter Performance in HBase - Part 1 http://palominodb.com/blog/2012/08/24/distributed-counter-performance-hbase-part-1

Facebook's New Realtime Analytics System: HBase To Process 20 Billion Events Per Day http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html

Realtime Analytics with Hadoop and HBase - http://www.slideshare.net/larsgeorge/realtime-analytics-with-hadoop-and-hbase

Log Event Processing with HBase http://tellapart.com/log-event-processing-with-hbase

Clickstream Analytics at BazaarVoice http://www.slideshare.net/bazaarvoice_engineering/austin-scales-clickstream-analytics

Realtime Analytics with HBase - http://www.slideshare.net/alexbaranau/realtime-analytics-with-hbase-long-version

Hbase Schema design

2 Answers