3
votes

I have a NLP task which basically is supervised text classification. I tagged a corpus with it's POS-tags, then i use the diferent vectorizers that scikit-learn provide in order to feed some classification algorithm that scikit-learn provide as well. I also have the labels (categories) of the corpus which previously i obtained in an unsupervised way.

First I POS-tagged the corpus, then I obtained some differents bigrams, they have the following structure:

bigram = [[('word','word'),...,('word','word')]]

Apparently it seems that i have everything to classify (i all ready classify with some little examples but not with all the corpus).

I would like to use the bigrams as features in order to present them to a classification algorithm(Multinomial naive bayes, SVM, etc).

What could be a standard (pythonic) way to arrange all the text data to classify and show the results of the classified corpus?. I was thinking about using arff files and use numpy arrays, but I guess it could complicate the task unnecessarily. By the other hand i was thinking about spliting the data into train and test folders but i dont visualize how to set up the labels in the train folder.

3

3 Answers

1
votes

The easiest option is load_files, which expects a directory layout

data/
    positive/     # class label
        1.txt     # arbitrary filename
        2.txt
        ...
    negative/
        1.txt
        2.txt
        ...
    ...

(This isn't really a standard, it's just convenient and customary. Some ML datasets on the web are offered in this format.)

The output of load_files is a dict with the data in them.

2
votes

Your question is very vague. There are books and courses on the subject you can access. Have a look at this blog for a start 1 and these course 2 and 3.

1
votes

1) larsmans has already mentioned a convenient way to arrange and store your data. 2) When using scikit, numpy arrays always make life easier as they have many features for changing the arrangement of your data easily. 3) Training data and testing data are labeled in the same way. So you would usually have something like:

bigramFeatureVector = [(featureVector0, label), (featureVector1, label),..., (featureVectorN, label)]

The proportion of training data to testing data highly depends on the size of your data. You should indeed learn about n-fold cross validation. Because it will resolve all your doubts, and most probably you have to use it for more accurate evaluations. Just to briefly explain it, for doing a 10-fold cross validation lets say you will have an array in which all your data along with labels are held (something like my above example). Then in a loop running for 10 times, you would leave one tenth of the data for testing and the rest for training. If you learn this then you would have no confusions about how training or testing data should look like. They both should look exactly the same. 4) How to visualize your classification results, depends on what evaluation meausures you would like to use. Its unclear in your question, but let me know if you have further questions.