0
votes

I am trying to classify pieces of text to categories. I have 9 categories but the given sentences i have can be classify to more categories. My objective is to take a piece of text and find the industry of each sentence, one common problem i have is that my training set does not have a "Porn" category and sentences with porn material classified to "Financial".

I want my classifier to check if the sentence can be categorized to a class and if not just print that cant classify that text.

I am using Tf-idf vectorizer to transform the sentences and then i feed the data to a LinearSVC.

Can anyone help me with this issue? Or can anyone provde me any usefull material?

1

1 Answers

0
votes

Firstly, the problem you have with the “Porn” documents being classified as “Financial” doesn’t seem to be entirely related to the other question here. I’ll address the main question for now.

The setting is that you have data for 9 categories, but the actual document universe is bigger. The problem is to determine that you haven’t seen the likes of a particular data point before. This seems to be more like outlier or anomaly detection, than classification.

You'll have to do some background reading to proceed further, but here are some points to get you started. One strategy to use is to determine if the new document is “similar” to other documents that you have in your collection. The idea being that an outlier is not likely to be similar to “normal” documents. To do this, you would need a robust measure of document similarity.

Outline of a potential method you can use:

  • Find a good representation of the documents, say tf-idf vectors, or better.
  • Benchmark the documents within your collection. For each document, the “goodness” score is the highest similarity score with all other documents in the collection. (Alternately, you can use k’th highest similarity, for some fault tolerance.)
  • Given the new document, measure its goodness score in a similar way.
  • How does the new document compare to other documents in terms of the goodness score? A very low goodness score is a sign of an outlier.

Further reading:

  • Survey of Anomaly Detection
  • LSA, which is a technique for text representation and similarity computation.