Right now I have table in Mysql with 3 columns.
DocId Int
Match_DocId Int
Percentage Match Int
I am storing document id along with its near duplicate document id and percentage which indicate how closely two documents match.
So if one document has 100 near duplicates, we have 100 rows for that particular document.
Right now, this table has more than 1 billion records for total of 14 millions documents. I am expecting total documents to go upto 30 millions. That means my table which stores near duplicate information will have more than 5 billions rows, may be more than that. (Near duplicate data grows exponentially compare to total document set)
Here are few issues that I have:
- Getting all there records in mysql table is taking lot of time.
- Query takes lot of time as well.
Here are few queries that I run:
Check if particular document has any near duplicate. (this is relatively fast, but still slow)
Check for given set of documents, how many near duplicates are there in each percentage range (Percentage range is 86-90, 91-95 , 96-100)?
This query takes lot of time. Most of the time it fails. I am going group by on percentage column.
Can this be managed with any available NoSql solution?
I am skeptical for SQL query support for NoSql solutions as I need group by support while querying data.