12
votes

I'm trying to use scikit-learn to do some machine learning on natural language data. I've got my corpus transformed into bag-of-words vectors (which take the form of a sparse CSR matrix) and I'm wondering if there's a supervised dimensionality reduction algorithm in sklearn capable of taking high-dimensional, supervised data and projecting it into a lower dimensional space which preserves the variance between these classes.

The high-level problem description is that I have a collection of documents, each of which can have multiple labels on it, and I want to predict which of those labels will get slapped on a new document based on the content of the document.

At it's core, this is a supervised, multi-label, multi-class problem using a sparse representation of BoW vectors. Is there a dimensionality reduction technique in sklearn that can handle that sort of data? Are there other sorts of techniques people have used in working with supervised, BoW data in scikit-learn?

Thanks!

5
a nice way to do dim reduction is with an autoencoder. im not sure if scikit-learn has one, though. an autoencoder is just a neural net where the output is an attempted reconstruction of the input, and the hidden layer (typically) has lower dimensionality then the input. that way the input is forced through a lower dimensional representation which must be at least good enough to reconstruct the input from. hence one can use that hidden layers as a compressed representation of the input. see: cs.toronto.edu/~hinton/science.pdf also, theano and pylearn2 are great for neural nets.Ethan
There is someone working on an autoencoder, but it's not finished yet.Fred Foo

5 Answers

2
votes

I am a bit confused by your question. In my experience, dimensionality reduction is never really supervised... but it seems that what you want is some sort of informed feature selection, which is impossible to do before the classification is done. In other words, you cannot know which features are more informative before your classifier is trained and validated.

However, reducing the size and complexity of your data is always good, and you have various ways to do it with text data. The applicability and performance depends on the type of vectors you have (frequency counts, tfidf) and you will always have to determine the number of dimensions (components) you want in your output. The implementations in scikit-learn are mostly in the decomposition module.

The most popular method in Natural Language Processing is Singular Value Decomposition (SVD), which is at the core of Latent Semantic Analysis (LSA, also LSI). Staying with scikit-learn, you can simply apply TruncatedSVD() on your data. A similar method is Non-negative matrix factorization, implemented in scikit-learn as NMF().

An increasingly popular approach uses transformation by random projections, Random Indexing. You can do this in scikit-learn with the functions in random_projection.

As someone pointed out in another answer, Latent Dirichlet Allocation is also an alternative, although it is much slower and computationally more demanding than the methods above. Besides, it is at the time of writing unavailable in scikit-learn.

If all you want is to simplify your data in order to feed it to a classifier, I would suggest SVD with n_components between 100 and 500, or random projection with n_components between 500 and 2000 (common values from the literature).

If you are interested in using the reduced dimensions as some sort of classification/clustering already (people call this topic extraction, although you are really not extracting topics, rather latent dimensions), then LDA might be better option. Beware, it is slow and it only takes pure frequency counts (no tfidf). And the number of components is a parameter that you have to determine in advance (no estimation possible).

Returning to your problem, I would make a sckit-learn pipeline with a vectorizer, dimensionality reduction options and classifier and would carry out a massive parameter search. In this way, you will see what gives you best results with the label set you have.

1
votes

You can use latent dirichlet allocation (here's the wiki) to discover the topics in your documents. For the assignment of a label to a document, you can use the conditional probability distribution for a document label (given the distribution over the topics in your document). If you have labels for your documents already, then you just need to learn the CPD, which is trivial. Unfortunately, scikit-learn does not have an LDA implementation, but gensim does.

PS: Here's another paper that may help. If you're not very well versed in statistical inference/learning or machine learning, I suggest that your start here (note: it's still assumes a high level of mathematical maturity).

1
votes

Several existing scikit modules do something similar to what you asked for.

  • Linear Discriminant Analysis is probably closest to what you asked for. It find a projection of the data that maximizes the distance between the class centroids relative to the projected variances.

  • Cross decomposition includes methods like Partial Least Squares which fit linear regression models for multidimentional targets via a projection through a lower dimentonial intermediate space. It is a lot like a single hidden layer neural net without the sigmoids. These are linear regression methods, but you could apply a 0-1 encoding to you target signal and use these models anyway.

  • You could use an L1 regularized classifier like LogisticRegression or SGDClassifier to do feature selection. RandomizedLogisticRegression combines this with bootstrapping get a more stable feature set.

0
votes

Try ISOMAP. There's a super simple built-in function for it in scikits.learn. Even if it doesn't have some of the preservation properties you're looking for, it's worth a try.

0
votes

Use a multi-layer neural net for classification. If you want to see what the representation of the input is in the reduced dimension, look at the activations of the hidden layer. The role of the hidden layer is by definition optimised to distinguish between the classes, since that's what's directly optimised when the weights are set.

You should remember to use a softmax activation on the output layer, and something non-linear on the hidden layer (tanh or sigmoid).