3
votes

I am stuck between a decision to apply classification or clustering on the data set I got. The more I think about it, the more I get confused. Heres what I am confronted with.

I have got news documents (around 3000 and continuously increasing) containing news about companies, investment, stocks, economy, quartly income etc. My goal is to have the news sorted in such a way that I know which news correspond to which company. e.g for the news item "Apple launches new iphone", I need to associate the company Apple with it. A particular news item/document only contains 'title' and 'description' so I have to analyze the text in order to find out which company the news referes to. It could be multiple companies too.

To solve this, I turned to Mahout.

I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel' etc as top terms in my clusters and from there I would know the news in a cluster corresponds to its cluster label, but things were a bit different. I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal', 'shares', 'street', 'olympics' and lots of other terms as the top ones (which makes sense as clustering algos' look for common terms). Although there were some 'Apple' clusters but the news items associated with it were very few.I thought may be clustering is not for this kind of problem as many of the company news goes into more general clusters(investment, profit) instead of the specific company cluster(Apple).

I started reading about classification which requires training data, The name was convincing too as I actually want to 'classify' my news items into 'company names'. As I read on, I got an impression that the name classification is a bit deceiving and the technique is used more for prediction purposes as compared to classification. The other confusions that I got was how can I prepare training data for news documents? lets assume I have a list of companies that I am interested in. I write a program to produce training data for the classifier. the program will see if the news title or description contains the company name 'Apple' then its a news story about apple. Is this how I can prepare training data?(off course I read that training data is actually a set of predictors and target variables). If so, then why should I use mahout classification in the first place? I should ditch mahout and instead use this little program that I wrote for training data(which actually does the classification)

You can see how confused I am about how to address this issue. Another thing that concerns me is that if its possible to make a system this intelligent, that if the news says 'iphone sales at a record high' without using the word 'Apple', the system can classify it as a news related to apple?

Thank you in advance for pointing me in the right direction.

2

2 Answers

3
votes

Copying my reply from the mailing list:

Classifiers are supervised learning algorithms, so you need to provide a bunch of examples of positive and negative classes. In your example, it would be fine to label a bunch of articles as "about Apple" or not, then use feature vectors derived from TF-IDF as input, with these labels, to train a classifier that can tell when an article is "about Apple".

I don't think it will quite work to automatically generate the training set by labeling according to the simple rule, that it is about Apple if 'Apple' is in the title. Well, if you do that, then there is no point in training a classifier. You can make a trivial classifier that achieves 100% accuracy on your test set by just checking if 'Apple' is in the title! Yes, you are right, this gains you nothing.

Clearly you want to learn something subtler from the classifier, so that an article titled "Apple juice shown to reduce risk of dementia" isn't classified as about the company. You'd really need to feed it hand-classified documents.

That's the bad news, but, sure you can certainly train N classifiers for N topics this way.

Classifiers put items into a class or not. They are not the same as regression techniques which predict a continuous value for an input. They're related but distinct.

Clustering has the advantage of being unsupervised. You don't need labels. However the resulting clusters are not guaranteed to match up to your notion of article topics. You may see a cluster that has a lot of Apple articles, some about the iPod, but also some about Samsung and laptops in general. I don't think this is the best tool for your problem.

1
votes

First of all, you don't need Mahout. 3000 documents is close to nothing. Revisit Mahout when you hit a million. I've been processing 100.000 images on a single computer, so you really can skip the overhead of Mahout for now.

What you are trying to do sounds like classification to me. Because you have predefined classes.

A clustering algorithm is unsupervised. It will (unless you overfit the parameters) likely break Apple into "iPad/iPhone" and "Macbook". Or on the other hand, it may merge Apple and Google, as they are closely related (much more than, say, Apple and Ford).

Yes, you need training data, that reflects the structure that you want to measure. There is other structure (e.g. iPhones being not the same as Macbooks, and Google, Facebook and Apple being more similar companies than Kellogs, Ford and Apple). If you want a company level of structure, you need training data at this level of detail.