I have a NLP task which basically is supervised text classification. I tagged a corpus with it's POS-tags, then i use the diferent vectorizers that scikit-learn provide in order to feed some classification algorithm that scikit-learn provide as well. I also have the labels (categories) of the corpus which previously i obtained in an unsupervised way.
First I POS-tagged the corpus, then I obtained some differents bigrams, they have the following structure:
bigram = [[('word','word'),...,('word','word')]]
Apparently it seems that i have everything to classify (i all ready classify with some little examples but not with all the corpus).
I would like to use the bigrams as features in order to present them to a classification algorithm(Multinomial naive bayes, SVM, etc).
What could be a standard (pythonic) way to arrange all the text data to classify and show the results of the classified corpus?. I was thinking about using arff files and use numpy arrays, but I guess it could complicate the task unnecessarily. By the other hand i was thinking about spliting the data into train and test folders but i dont visualize how to set up the labels in the train folder.