I'm attempting to use MALLET's TopicInferencer
to infer keywords from arbitrary text using a trained model. So far my overall approach is as follows.
- Train a
ParallelTopicModel
with a large set of known training data to create a collection of topics. I'm currently using a training file with 250,000 lines to create 5,000 topics. - Create an
InstanceList
from arbitrary text not in the trained model. - Use the trained model's
topicInferencer.getSampledDistribution
to generate a topic distribution of the unknown text against the model. - Sort the returned distribution and extract the IDs of the top
n
topics that closets match the unknown input text. - Extract the top keywords from each of the matching topics.
My code is as follows:
Generating the ParallelTopicModel
InstanceList instanceList = new InstanceList(makeSerialPipeList());
instanceList.addThruPipe(new SimpleFileLineIterator(trainingFile)); //training file with one entry per line (around 250,000 lines)
//should train a model with the end result being 5000 topics each with a collection of words
ParallelTopicModel parallelTopicModel = new ParallelTopicModel(
5000, //number of topics, I think with a large sample size we should want a large collection of topics
1.0D, //todo: alphaSum, really not sure what this does
0.01D //todo: beta, really not sure what this does
);
parallelTopicModel.setOptimizeInterval(20); //todo: read about this
parallelTopicModel.addInstances(instanceList);
parallelTopicModel.setNumIterations(2000);
parallelTopicModel.estimate();
My first group of questions are related to creating the ParallelTopicModel
.
Since I'm using a fairly large training file I assume I want a large number of topics. My logic here is that the large the count of topics the more closely inferred keywords will match arbitrary input text.
I'm also unsure how the alphaSum beta value and number of iterations will affect the generated model.
On the other side I'm using the ParallelTopicModel
to create an inferred topic distribution.
TopicInferencer topicInferencer = parallelTopicModel.getInferencer();
String document = //arbitrary text not in trained model
//following the format I found in SimpleFileLineIterator to create an Instance out of a document
Instance instance = new Instance(document, null, new URI("array:" + 1), null);
InstanceList instanceList = new InstanceList(serialPipes); //same SerialPipes used to create the InstanceList used for the ParallelTopicModel
instanceList.addThruPipe(instance);
//this should return the array of topicIDs and the match value
//[topicId] = 0.5 //match value
double[] topicDistribution =
topicInferencer.getSampledDistribution(instanceList.get(0), //extract text
2000, //same iteration count used in the model
1, //todo: thinning, not sure what this does
5 //todo: burnIn, not sure what this does
);
//returns a sorted list of the top 5 topic IDs
//this should be the index of the largest values in the returned topicDistribution
List<Integer> topIndexes = topIndexes(topicDistribution, 5); //top 5 topic indexes
//list topics and sorted keywords
ArrayList<TreeSet<IDSorter>> sortedWords = parallelTopicModel.getSortedWords();
//loop over the top indexes
topIndexes.forEach(index -> {
IDSorter idSorter = sortedWords.get(index).first(); //should hopefully be the first keyword in each topic
//not sure what alphabet I should use here or if it really matters?
//I passed in the alphabet from the original instance list as well as the one contained on our model
Object result = parallelTopicModel.getAlphabet().lookupObject(idSorter.getID());
double weight = idSorter.getWeight();
String formattedResult = String.format("%s:%.0f", result, weight);
//I should now have a relevant keyword and a weight in my result
});
I have a similar set of questions here, first I'm not entirely sure if this overall approach is event correct.
I'm also not sure what Alphabet
I should be using, the one from my InstanceList
used to generate the ParallelTopicModel
or the one obtained directly from the ParallelTopicModel
.
I know this is a fairly involved question but any insight would be greatly appreciated!