0
votes

I've calculated the TF of my dataset and I'm currently trying to calculate the IDF for it. I'm confused to which number to use for the calculation.

id       uid
1         a
1         b
1         c
1         d
2         a
2         b
2         c
2         e
3         b
3         c 
3         e
3         f
(3 items)

Occurrence
a = 2
b = 3
c = 3
d = 1
e = 2
f = 1

Which gives something like this below:

  A B C
A - 2 2
B 2 - 3
C 2 3 -

Formula

IDF(t,D)=log(Total Number documents/Number of Document matching term);

For example using (A,B) which value is 2: how should I go about calculating it?
Total items = 3
Number of document matching terms = should i be using A or B value? (2 or 3)

(A,B) * log(total / matching)
= 2 * log ( 3 / 2 or 3) ?
1

1 Answers

0
votes

I am not sure what you meant by (A,B).

But I assume that from your dataset: the first column is document id, and the second column is term.

If my assumption is correct then: doc id 1 is "a b c d" doc id 2 is "a b c e" doc id 3 is "b c e f"

Your formula for IDF(t, D) is log(# of documents / # of documents that contains that term). Thus, we can calculate IDF for each term as the following:

IDF('a', D) = log(3 / 2) IDF('b', D) = log(3 / 3) and so on...

Here is my reference: https://en.wikipedia.org/wiki/Tf%E2%80%93idf