Force create Terms using tm package

Question

I have a corpus that has words such as 5k,50k,7.5k,75k,10K,100K. So when i create a TDM using the tm package, terms such as 10k and 100k are extracted separately. However , 5k and 7.5k are not extracted as separate terms. Now , i understand that after punctuation correction "7.5k" might be falling under "75k" terms , but whats going on with "5k" . Why is it not extracted as a term ?

Basically , i would want to know if there is way to FORCE tm package to look for specific words and extract them as key terms.

Any pointers would help !!

JWLM JWLM · Accepted Answer · 2017-01-20T15:38:14

Are you breaking words at punctuation? That is, is '.' a word-break character? If so, then the split of '7.5k' is ('7', '5k'), the second of which matches '5k'.

Force create Terms using tm package

1 Answers