1
votes

I have a million files which includes free text. Each file has been assigned a code or number of codes. The codes can be assumed as categories. I have normalized the text by removing stop words. I am using scikit-learn libsvm to train the model to predict the files for the right code/s (category).

I have read and searched a lot but i couldn't understand how to represent my textual data into integers, since SVM or most machine learning tools use numerical values for learning.

I think i would need to find tf-idf for each term in the whole corpus. But still i am not sure how would that help me to convert my textual data into libsvm format.

any help would be greatly appreciated, Thank you.

1

1 Answers

1
votes

You are not forced to use tf-idf.

To begin with follow this simple approach:

  1. Select all distinct words in all your documents. This will be your vocabulary. Save it in a file.
  2. For each word in a specific document, replace it with the index of the word in your vocabulary file.
  3. and also add the number of time the word appears in the document

Example:

I have two documents (stop word removed, stemmed) :

hello world

and

hello sky sunny hello

Step 1: I generate the following vocabulary

hello
sky
sunny
world

Step 2:

I can represent my documents like this:

1 4

(because the word hello is in position 1 in the vocabulary and the word world is in position 4) and

1 2 3 1


Step 3: I add the term frequency near each term and remove duplicates

1:1 4:1

(because the word hello appears 1 time in the document, and the word world appears 1 time)

and

1:2 2:1 3:1


If you add the class number in front of each line, you have a file in libsvm format:

1 1:1 4:1
2,3 1:2 2:1 3:1 

Here the first document has class 1, and the second document has class 2 and 3.

In this example each word is associated with the term frequency. To use tf-idf you do the same but replace the tf by the computed tf-idf.