I am using the TM package from CRAN in R. I have problems creating a DocumentTermMatrix based on a Corpus. The problems is when I create a TermDocumentMatrix based on an UTF-8 corpus then some words turn into a unicode symbols.
corpus <- Corpus(VectorSource(vector_with_texts_in_several_languages, encoding = "UTF-8"))
tdm <- TermDocumentMatrix(corpus, control=list(weighting=weightTfIdf))
print(Terms(tdm)[1:3])
Returns:
[1] "<U+03BB>a<U+03B3><U+03AF>a"
[2] "<U+03C1><U+03AE>fa<U+03BD><U+03BF><U+03C2>"
[3] "<U+03C1><U+03AF>p<U+03BF><U+03C5>"
If I manually inspect the corpus then I see correct output.
print(corpus[[1]])
Returns:
квартира на кутузовском
Does anyone know how I can get a TermDocumentMatrix with correct Terms? Or is there a way to convert these unicode symbols into 'readable' output again?
Note: print(Terms(tdm)) does NOT contain words from print(corpus[[1]])