7
votes

Let's say I have harvested the posts from a forum. Then I removed all the usernames and signatures, so that now I only know what post was in which thread but not who posted what, or even how many authors there are (though clearly the number of authors cannot be greater than the number of texts).

I want to use a Markov model (look at which words/letters follow which ones) to figure out how many people used this forum, and which posts were written by the same person. To vastly simplify, perhaps one person tends to say "he were" while another person tends to say "he was" - I'm talking about model that works with this sort of basic logic.

Note how there are some obvious issues with the data: Some posts may be very short (one word answers). They may be repetitive (quoting each other or using popular forum catchphrases). The individual texts are not very long.

One could suspect that it would be rare for a person to make consecutive posts or that it is likely that people are more likely to post in threads they have already posted in. Exploiting this is optional.

Let's assume the posts are plaintexts and have no markup, and that everyone on the forum uses English.

I would like to obtain a distance matrix for all texts T_i such that D_ij is the probability that text T_i and text T_j are written by the same author, based on word/character pattern. I am planning to use this distance matrix to cluster the texts, and ask questions such as "What other texts were authored by the person who authored this text?"

How would I actually go about implementing this? Do I need a hidden MM? If so, what is the hidden state? I understand how to train an MM on a text and then generate a similar text (eg. generated Alice in the Wonderland) but after I train a frequency tree, how do I check a text with it to get the probability that it was generated by that tree? Should I look at letters, or words when building the tree?

3
Based on Robert's comment, I think perhaps the distance matrix may be a separate concern. If, like he says, the way to do this is start with a probability of author-text association, and deal with distance matrix separately, perhaps it would be better for me to revise my question and not go into the matrix (for now, I am not sure).Superbest

3 Answers

2
votes

My advice is put aside the business about the distance matrix and think first about a probabilistic model P(text | author). Constructing that model is that hard part of your work; once yo have it, you can compute P(author | text) via Bayes' rule. Don't put the cart before the horse: the model might or might not involve distance metrics or matrices of various kinds, but don't worry about that, just let it fall out of the model.

1
votes

You might want to take a look at Hierarchical Clustering. With this algorithm you can define your own distance function and it will give you clusters based on it. If you define a good distance function, the resulting clusters will correspond to one author each.

This is probably quite hard to do though and you might need a lot of posts to really get an interesting result. Nevertheless, I wish you good luck!

1
votes

You mention a Markov model in your question. Markov models are about sequences of tokens and how one token depends on previous tokens and possibly internal state.

If you want to use probabilistic methods you might want to use a different kind of statistical model that is not so much based on sequences but on bags or sets of words or features.

For example you could use the most K frequent words of the text and create all M-grams of tokens in each post where the nonfrequent words are replaced by empty placeholders. This could allow you to learn phrases commonly used by different authors.

In addition you could use single words as features, so that a post gets as features all words in the post (here you can ignore frequent words and use only rare words - the same authors might be interested in the same topics or use the same words or do the same spelling mistakes).

Additionally you can try to capture the style of authors in features: how many paragraphs, how long sentences, how many commas per sentence, does the author use capitalization or not, are numbers spelled out or not, etc ... these are all features that are not sequences as you would use in a HMM but features assigned to each post.

In summary: even though sequences are certainly important to catch phrases you definitely want more than just a sequence model.