I have a million files which includes free text. Each file has been assigned a code or number of codes. The codes can be assumed as categories. I have normalized the text by removing stop words. I am using scikit-learn libsvm to train the model to predict the files for the right code/s (category).
I have read and searched a lot but i couldn't understand how to represent my textual data into integers, since SVM or most machine learning tools use numerical values for learning.
I think i would need to find tf-idf for each term in the whole corpus. But still i am not sure how would that help me to convert my textual data into libsvm format.
any help would be greatly appreciated, Thank you.