Let's assume the following scenario.
Lucene document: ArticleDocument
Fields: {Id, text, publisherId}
A publisher can publish multiple articles.
Problem
I would like to build word clouds (most frequent words, shingles) for each Publisher Id.
After my investigation, I could find ways to get most frequent terms for the entire Index or a document but not for a subset of documents. I found a similar question but that's Lucene 2.x and I'm hoping there exists an effective way in recent Lucene.
Please could you guide me with a way to perform that in Lucene 4.x (preferred) or 3.x (latest in version 3).
Please note that I cannot make each Publisher a document with all the articles being appended to a field.
That's because I would like to have those words in the cloud to be searchable with corresponding articles (by same publisher id) being the results.
I'm not sure whether maintaining two types of lucene documents (article and publisher) is a good idea in terms of maintenance and performance.