1
votes

I have a text classification task. By now i only tagged a corpus and extracted some features in a bigram format (i.e bigram = [('word', 'word'),...,('word', 'word')]. I would like to classify some text, as i understand SVM algorithm only can receive vectors in orther to classify, so i use some vectorizer in scikit as follows:

bigram = [ [('load', 'superior')
             ('point', 'medium'), ('color', 'white'),
             ('the load', 'tower')]]

fh = FeatureHasher(input_type='string')

X = fh.transform(((' '.join(x) for x in sample)
                  for sample in bigram))
print X

the output is a sparse matrix:

  (0, 226456)   -1.0
  (0, 607603)   -1.0
  (0, 668514)   1.0
  (0, 715910)   -1.0

How can i use the previous sparse matrix X to classify with SVC?, assuming that i have 2 classes and a train and test sets.

1
Every document should be a sparse vector in your matrix... libSVM expects your data to be sparse vectors... so what is your question, have you actually tried anything? - Has QUIT--Anony-Mousse
my question is how can i use the sparse matrix X to classify?... what's not clear?... - tumbleweed
Sparse matrix = collection of sparse vectors. libSVM preferred input format: collection of sparse vectors. Just think outside of the "everything is a matrix" box. - Has QUIT--Anony-Mousse
Decompose your matrix into vectors. Use these vectors as features for classification. Vectors could simply be columns from the matrix - alvas
You just need labels y and then you can use SVC().fit(X, y). Not sure where the issue is. - Andreas Mueller

1 Answers

1
votes

As others have pointed out, your matrix is just a list of feature vectors for the documents in your corpus. Use these vectors as features for classification. You just need classification labels y and then you can use SVC().fit(X, y).

But... the way that you have asked this makes me think that maybe you don't have any classification labels. In this case, I think you want to be doing clustering rather than classification. You could use one of the clustering algorithms to do this. I suggest sklearn.cluster.MiniBatchKMeans to start. You can then output the top 5-10 words for each cluster and form labels from those.