3
votes

I'm trying to classify an example, which contains discrete and continuous features. Also, the example represents sparse data, so even though the system may have been trained on 100 features, the example may only have 12.

What would be the best classifier algorithm to use to accomplish this? I've been looking at Bayes, Maxent, Decision Tree, and KNN, but I'm not sure any fit the bill exactly. The biggest sticking point I've found is that most implementations don't support sparse data sets and both discrete and continuous features. Can anyone recommend an algorithm and implementation (preferably in Python) that fits these criteria?

Libraries I've looked at so far include:

  1. Orange (Mostly academic. Implementations not terribly efficient or practical.)
  2. NLTK (Also academic, although has a good Maxent implementation, but doesn't handle continuous features.)
  3. Weka (Still researching this. Seems to support a broad range of algorithms, but has poor documentation, so it's unclear what each implementation supports.)
3

3 Answers

2
votes

Support vector machines? libsvm can be used from Python, and is quite speedy.

Handles sparse vector inputs, and won't mind if some of the features are continuous, where others are just -1/+1. (If you've got an n-way discrete feature, the standard thing to do is expand it into n binary features.)

2
votes

Weka (Java) satisfies all you requirements:

Check out this Pentaho wiki for a list of links to documentations, guides, video tutorials, etc ...

2
votes

scikit-learn, a Python machine learning module supports Stochastic Gradient Descent and Support Vector machines for sparse data.