0
votes

I want to build a word cloud containing multiple word structures (not just one word). In any given text we will have bigger frequencies for unigrams than bigrams. Actually, the n-gram frequency decreases when n increases for the same text.

I want to find a magic number or a method to obtain comparative results between unigrams and bigrams, trigrams, n-grams.

There is any magic number as a multiplier for n-gram frequency in order to be comparable with a unigram?

A solution that I have now in mind is to make a top for any n-gram (1, 2, 3, ...) and use the first z positions for any category of n-grams.

1
Could you give a little more context on how you want to compare the two? I can think of some measures that would make for some comparison (Information gain, specificity), but most of which need a context.S van Balen

1 Answers

1
votes

As you've asked this, there is no simple linear multiplier. You can make a general estimate by the size of your set of units. Consider the English alphabet of 26 letters: you have 26 possible unigrams, 26^2 digrams, 26^3 trigrams, ... Simple treatment suggests that you would multiply a digram's frequency by 26 to compare it with unigrams; trigram frequencies would get a 26^2 boost.

I don't know whether that achieves the comparison you want, as the actual distribution of n-grams is not according to any mathematically tractable function. For instance, letter-trigram distribution is a good way to differentiate the language in use: English, French, Spanish, German, Romanian, etc. have readily differential distributions.

Another possibility is to normalize the data: convert each value into a z-score, the amount of standard deviations above/below the mean of the distribution. The resulting list of z-scores has a mean of 0 and a standev of 1.0

Does either of those get you the results you need?