0
votes

I have to design an Hbase table to store users information, this information is targeted for social networking, like: age, sex, education, hobbies, read books, traveled countries ... NOTE: we could add more information in future, we dont know all information now.

for example: name: Olha, age: 25, sex: female, education: bachelor Information technology, education: master computer science, hobby: basket ball, hobby: ping pong, book: gone with the wind, book: Davinci code, language: english, language: french, Country: Germany

The main idea is to be able to do queries like: return all people who are female, age: 22 years old, speak: english, speak: french, read the book gone with the wind, like ping pong, like basket ball and German.

so you can add any criteria to the search query.

what is your suggestion about the HBASE table schema ( row key, column family ... ) that optimized this kind of search queries ( taking into consideration that we will add more information in future ) what is the best way to write such query ( scan, get, MapReduce ).

Thank you

2
I don't think HBase is a good choice for complex&dynamic queries.ericson
For this kind of highly interconnected entities I would consider Graph databases like Neo4J or Titan. Depending on your requirements regarding replication, availability and maturity.LMeyer
this is a kind of a research project, so i have to use Hbase.user1027364

2 Answers

1
votes

I would agree with Ian Varley that Solr/Lucene and it's faceted queries and joins allow you to pivot the data in the way you want to see it - however - I also think your question might be a "counting" question or a "membership" question....

It sounds like you are after a list of people who match (N) attributes - the problem you have is that for each attribute you could have millions of user ids?

HBase is a good fit when all you are trying to do is compute intersection/union sizes.. Your key/value pairs can be put into Hbase, and you can "encode" the IDs of the users into either a Bloom Filter and HyperLogLog. Trading speed for accuracy and memory. Likely running map/reduce style jobs hourly/nightly on click-streams of log aggregation of some type.

Others have done this in the advertising space and online space for exactly the type of queries you are running ("find people who like red bull and pop-tarts that live in florida")

References

Contextual Advertising using Apache Hive and Amazon EMR http://aws.amazon.com/articles/2855

Scaling Distributed Counters: http://whynosql.com/scaling-distributed-counters/

Google: Sharding counters https://developers.google.com/appengine/articles/sharding_counters

Distributed Counter Performance in HBase - Part 1 http://palominodb.com/blog/2012/08/24/distributed-counter-performance-hbase-part-1

Facebook's New Realtime Analytics System: HBase To Process 20 Billion Events Per Day http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html

Realtime Analytics with Hadoop and HBase - http://www.slideshare.net/larsgeorge/realtime-analytics-with-hadoop-and-hbase

Log Event Processing with HBase http://tellapart.com/log-event-processing-with-hbase

Clickstream Analytics at BazaarVoice http://www.slideshare.net/bazaarvoice_engineering/austin-scales-clickstream-analytics

Realtime Analytics with HBase - http://www.slideshare.net/alexbaranau/realtime-analytics-with-hbase-long-version

0
votes

This isn't a great use of HBase, in the sense that this is exactly the kind of thing that search indexes (like Lucene) are good for.

One normal schema to store users and their information might look a lot like a relational database, in that you'd have 1 row per user, and store all the attributes as columns & values (age=22, language=french, etc). This works well for the extensibility you mention (you don't need to change any schema in order to store new attributes). With this schema, you could look up any one user (and all of their attributes) by the unique user id. That'd be blazingly fast to do, no matter how many users you have.

However, with that schema, if you want to search in the way you describe ("return all users whose age is 22"), every single query is going to end up being a scan of the entire table, because HBase only allows you to access things via their primary key; it does not have secondary indexing of any kind. That will be extremely inefficient (picture having to scan a million rows every time you want to do any single query).

How to fix this? You could "reverse" the ordering of the data and put the values in the row key and then point to all the users with that value. For example, the row key could be "age:22", and then in the columns of the row could be all the userids that are age 22. This is problematic for a lot of reasons, not least of which is that it will be extremely expensive and tricky to make updates. But it would perform well for those specific queries.

The trick? That's exactly what a search index (like Lucene) does, and it does it much better than you could by rolling your own with HBase. That sounds like the tool you want to be using here.

If you must use HBase (as you say, since it's a research project) it might be worth looking into using HBase and Lucene together; google that for pointers.