1) Calculate tf-idf ( Generally better than t-f alone but completely depends on your data set and requirement)
From wiki ( regarding idf )
An inverse document frequency factor is incorporated which diminishes
the weight of terms that occur very frequently in the document set and
increases the weight of terms that occur rarely.
2) No , it is not important that both the documents have same number of words.
3) You can find tf-idf
or cosine-similarity
in any language now days by invoking some machine learning library function. I prefer python
Python code to calculate tf-idf and cosine-similarity ( using scikit-learn 0.18.2 )
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.datasets import fetch_20newsgroups
example_data = fetch_20newsgroups(subset='all').data
max_features_for_tfidf = 10000
is_idf = True
vectorizer = TfidfVectorizer(max_df=0.5, max_features=max_features_for_tf_idf,
min_df=2, stop_words='english',
use_idf=is_idf)
X_Mat = vectorizer.fit_transform(example_data)
cosine_sim = cosine_similarity(X=X_Mat, Y=X_Mat)
4) You might be interested in truncated Singular Value Decomposition (SVD)