What is the standard way in scikit-learn to arrange textual data for text classification?

Question

I have a NLP task which basically is supervised text classification. I tagged a corpus with it's POS-tags, then i use the diferent vectorizers that scikit-learn provide in order to feed some classification algorithm that scikit-learn provide as well. I also have the labels (categories) of the corpus which previously i obtained in an unsupervised way.

First I POS-tagged the corpus, then I obtained some differents bigrams, they have the following structure:

bigram = [[('word','word'),...,('word','word')]]

Apparently it seems that i have everything to classify (i all ready classify with some little examples but not with all the corpus).

I would like to use the bigrams as features in order to present them to a classification algorithm(Multinomial naive bayes, SVM, etc).

What could be a standard (pythonic) way to arrange all the text data to classify and show the results of the classified corpus?. I was thinking about using arff files and use numpy arrays, but I guess it could complicate the task unnecessarily. By the other hand i was thinking about spliting the data into train and test folders but i dont visualize how to set up the labels in the train folder.

Fred Foo Fred Foo · Accepted Answer · 2014-12-08T21:54:54

The easiest option is load_files, which expects a directory layout

data/
    positive/     # class label
        1.txt     # arbitrary filename
        2.txt
        ...
    negative/
        1.txt
        2.txt
        ...
    ...

(This isn't really a standard, it's just convenient and customary. Some ML datasets on the web are offered in this format.)

The output of load_files is a dict with the data in them.

What is the standard way in scikit-learn to arrange textual data for text classification?

3 Answers