3
votes

Right now I have table in Mysql with 3 columns.

DocId             Int
Match_DocId       Int
Percentage Match  Int

I am storing document id along with its near duplicate document id and percentage which indicate how closely two documents match.

So if one document has 100 near duplicates, we have 100 rows for that particular document.

Right now, this table has more than 1 billion records for total of 14 millions documents. I am expecting total documents to go upto 30 millions. That means my table which stores near duplicate information will have more than 5 billions rows, may be more than that. (Near duplicate data grows exponentially compare to total document set)

Here are few issues that I have:

  1. Getting all there records in mysql table is taking lot of time.
  2. Query takes lot of time as well.

Here are few queries that I run:

  • Check if particular document has any near duplicate. (this is relatively fast, but still slow)

  • Check for given set of documents, how many near duplicates are there in each percentage range (Percentage range is 86-90, 91-95 , 96-100)?

    This query takes lot of time. Most of the time it fails. I am going group by on percentage column.

Can this be managed with any available NoSql solution?

I am skeptical for SQL query support for NoSql solutions as I need group by support while querying data.

2
Why would you use NoSQL for this? Why do you think it's the right tool for the job? If you have only 3 columns in the db, a computer with 8-12 gigs of ram and a decent i5-category CPU running on properly configured InnoDB will chew this like it's nothing.N.B.
@oleksii I am not sure that NoSql will help or not. I am not able to scale MySql thats why I am looking for NoSql options.ashish
ashish, you probably want to notify @N.B. I Agree that vertical scalability has a limit, this is where horizontal scalability kicks in.oleksii
sorry for the mixup. It was targeted to @N.B.ashish
There is no reason why a NoSQL would be "quicker" and it probably won't be. Relational databases were meant for this kind of job, and even though it might seem that 5 billion rows is a lot of data - the data is pretty trivial - 3 columns of integers.N.B.

2 Answers

2
votes

MySQL

You can try sharding with your current MySql solution, i.e. splitting your large database into smaller distinctive databases. The problem with that is you should only work with one shard at a time and this would be fast. If you plan to use queries across several shards then it would be painfully slow.

NoSql

Apache Hadoop stack will be worth looking at. There are several systems that allow you to perform slightly different queries. A good point is that they all tend to interoperate well between each other.

Check if particular document has any near duplicate. (this is relatively fast, but still slow)

HBase can do this job for big table.

Check for given set of documents, how many near duplicates are there in each percentage range ? (Percentage range is 86-90, 91-95 , 96-100)

This should be a good fit for Map-Reduce


There are many other solutions, see this link for a list and brief description of other NoSql databases.

1
votes

We have good experiences with Redis. It's fast, can be make as reliable as you want it to. Other options could be CouchDB or Cassandra.