0
votes

My environment: 8GB Ram Notebook with Ubuntu 14.04, Solr 4.3.1, carrot2workbench 3.10.0

My Solr Index: 15980 documents

My Problem: Cluster all documents with the kmeans algorithm

When I drop off the query in the carrot2workbench (query: :), I always get a Java heap size error when using more than ~1000 Results. I started Solr with -Xms256m -Xmx6g but it still occurs.

Is it really a heap size problem or could it be somewhere else?

2

2 Answers

0
votes

Your suspicion is correct, it is a heap size problem, or more precisely, a scalability constraint. Straight from the carrot2 FAQs: http://project.carrot2.org/faq.html#scalability

How does Carrot2 clustering scale with respect to the number and length of documents? The most important characteristic of Carrot2 algorithms to keep in mind is that they perform in-memory clustering. For this reason, as a rule of thumb, Carrot2 should successfully deal with up to a thousand of documents, a few paragraphs each. For algorithms designed to process millions of documents, you may want to check out the Mahout project.

A developer also posted about this here: https://stackoverflow.com/a/28991477

While the developers recommend Mahout, and this is probably the way to go since you would not be bound by the in-memory clustering constraints as in carrot2, there might be other possibilities, though:

  1. If you really like carrot2 but do not necessarily need k-means, you could take a look at the commercial Lingo3G, based on the "Time of clustering 100000 snippets [s] " field and the (***) remark on http://carrotsearch.com/lingo3g-comparison it should be able to tackle more documents. Check also their FAQ entry on "What is the maximum number of documents Lingo3G can cluster?" on http://carrotsearch.com/lingo3g-faq

  2. Try to minimize the size of your labels on which k-means is performing the clustering. Instead of clustering over all the documents content, try to cluster on the abstract/summary or extract important keywords and cluster on them.

0
votes

That seems as if Carrot uses much to much memory.

K-means doesn't need a whole lot of memory - one integer per document.

So you should be able to run k-means on millions of documents in memory; even with the document vectors in memory.

16k documents is not a lot, so I don't see why you should run into trouble with a good implementation yet. Seems they really want you to buy the commercial version to make a living! Going Mahout seems like overkill to me. Your data still fits into main memory, I guess, so don't waste time on distributing it over the network which is a million times slower than your memory.

Maybe implement k-means yourself. It's not difficult...