I'm trying to use TFIDF for relative frequency to calculate cosine distance. I've selected 10 words from one document say: File 1 and selected another 10 files from my folder, using the 10 words and their frequency to check which of the 10 files are similar to File 1. Say Total number of files in folder are 46.i know that DF(is the no of documents the word appears in) IDF(is log(total no of files(46)/DF) and TFIDF(is the product of TF(frequency of the word in one doc) and IDF)
QUESTION:
Assuming what i said above is 100% correct, after getting the TFIDF for all 10 words in one document say: File 2, Do i add all the TFIDF for each of the 10 words together to get the TFIDF for File 2?
What is the cosine distance?
Could anyone help with an example?