1
votes

I have 2 kinds of data -

1) Schemaless (not exactly schemaless but the columns keep on increasing over time and we don't want our load/publish jobs to change when the schema changes). This data is right now stored in a key-val storage . The number of keys is around 1000. Number of pairs is around 700 million

2) RDBMS tables- A set of tables, each with with millions of rows.

I need to create a data store that allow analytics (preferable using SQL) on all the above data. I was going through some solutions for this problem and felt that likes of Spark and Apache Drill can solve this problem. Is this the correct use-case for Spark-Shark? What other data-stores/solutions can I use in this use-case - Cassandra? MongoDB?

Thanks.

1

1 Answers

0
votes

As a contributor to Drill I would put the answers based on the capabilities of Drill -
1. Yes Drill is well suited for schemaless files and it identifies file schema on the fly.
2. Drill already has capability to Query Mongo and HBase. RDBMS and Cassandra is not there yet but is in roadmap.