I need to write a module to detect similar documents. I have read many papers of fingerprints of documents techniques and others, but I do not know how to write code or implement such a solution. The algorithm should work for Chinese, Japanese, English and German language or be language independent. How can I accomplish this?
10 Answers
Bayesian filters have exactly this purpose. That's the techno you'll find in most tools that identify spam.
Example, to detect a language (from http://sebsauvage.net/python/snyppets/#bayesian) :
from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')
>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]
>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]
But it works to detect any type you will train it for : technical text, songs, jokes, etc. As long as you can provide enought material to let the tool learn what does you document looks like.
If these are pure text documents, or you have a method to extract the text from the documents, you can use a technique called shingling.
You first compute a unique hash for each document. If these are the same, you are done.
If not, you break each document down into smaller chunks. These are your 'shingles.'
Once you have the shingles, you can then compute identity hashes for each shingle and compare the hashes of the shingles to determine if the documents are actually the same.
The other technique you can use is to generate n-grams of the entire documents and compute the number of similar n-grams in each document and produce a weighted score for each document. Basically an n-gram is splitting a word into smaller chunks. 'apple' would become ' a', ' ap', 'app', 'ppl', 'ple', 'le '. (This is technically a 3-gram) This approach can become quite computationally expensive over a large number of documents or over two very large documents. Of course, common n-grams 'the', ' th, 'th ', etc need to be weighted to score them lower.
I've posted about this on my blog and there are some links in the post to a few other articles on the subject Shingling - it's not just for roofers.
Best of luck!
You can use or at last study difflib from Python's stdlib to write your code.
It is very flexible, and has algorithms to find differences between lists of strings, and to point these differences. Then you can use the get_close_matches()
to find similar words:
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']
It is not the solution but maybe it is a start.
Similarity can be found easily without classification. Try this O(n2) but works fine.
def jaccard_similarity(doc1, doc2):
a = sets(doc1.split())
b = sets(doc2.split())
similarity = float(len(a.intersection(b))*1.0/len(a.union(b))) #similarity belongs to [0,1] 1 means its exact replica.
return similarity
You need to make your question more concrete. If you've already read the fingerprinting papers, you already know the principles at work, so describing common approaches here would not be beneficial. If you haven't, you should also check out papers on "duplicate detection" and various web spam detection related papers that have come out of Stanford, Google, Yahoo, and MS in recent years.
Are you having specific problems with coding the described algorithms?
Trouble getting started?
The first thing I'd probably do is separate the tokenization (the process of extracting "words" or other sensible sequences) from the duplicate detection logic, so that it is easy to plug in different parsers for different languages and keep the duplicate detection piece the same.
There is a rather good talk on neural networks on Google Techtalks that talks about using layered Boltzmann machines to generate feature vectors for documents that can then be used to measure document distance. The main issue is the requirement to have a large sample document set to train the network to discover relevant features.
If you are trying to detect the documents that are talking about the same topic, you could try collecting the most frequently used words, throw away the stop words . Documents that have a similar distribution of the most frequently used words are probably talking about similar things. You may need to do some stemming and extend the concept to n-grams if you want higher accuracy. For more advanced techniques, look into machine learning.
You might want to look into the DustBuster algorithm as outlined in this paper.
From the paper, they're able to detect duplicate pages without even examining the page contents. Of course examining the contents increases the efficacy, but using raw server logs is adequate for the method to detect duplicate pages.
Similar to the recommendation of using MD5 or SHA1 hashes, the DustBuster method largely relies on comparing file size as it primary signal. As simple as it sounds, it's rather effective for an initial first pass.