Firstly, the problem you have with the “Porn” documents being classified as “Financial” doesn’t seem to be entirely related to the other question here. I’ll address the main question for now.
The setting is that you have data for 9 categories, but the actual document universe is bigger. The problem is to determine that you haven’t seen the likes of a particular data point before. This seems to be more like outlier or anomaly detection, than classification.
You'll have to do some background reading to proceed further, but here are some points to get you started. One strategy to use is to determine if the new document is “similar” to other documents that you have in your collection. The idea being that an outlier is not likely to be similar to “normal” documents. To do this, you would need a robust measure of document similarity.
Outline of a potential method you can use:
- Find a good representation of the documents, say tf-idf vectors, or better.
- Benchmark the documents within your collection. For each document, the “goodness” score is the highest similarity score with all other documents in the collection. (Alternately, you can use k’th highest similarity, for some fault tolerance.)
- Given the new document, measure its goodness score in a similar way.
- How does the new document compare to other documents in terms of the goodness score? A very low goodness score is a sign of an outlier.
Further reading:
- Survey of Anomaly Detection
- LSA, which is a technique for text representation and similarity computation.