Unicode symbols when creating DocumentTermMatrix

Question

I am using the TM package from CRAN in R. I have problems creating a DocumentTermMatrix based on a Corpus. The problems is when I create a TermDocumentMatrix based on an UTF-8 corpus then some words turn into a unicode symbols.

corpus <- Corpus(VectorSource(vector_with_texts_in_several_languages, encoding = "UTF-8"))
tdm <- TermDocumentMatrix(corpus, control=list(weighting=weightTfIdf))
print(Terms(tdm)[1:3])

Returns:

[1] "<U+03BB>a<U+03B3><U+03AF>a"
[2] "<U+03C1><U+03AE>fa<U+03BD><U+03BF><U+03C2>" 
[3] "<U+03C1><U+03AF>p<U+03BF><U+03C5>"

If I manually inspect the corpus then I see correct output.

print(corpus[[1]])

Returns:

квартира на кутузовском

Does anyone know how I can get a TermDocumentMatrix with correct Terms? Or is there a way to convert these unicode symbols into 'readable' output again?

Note: print(Terms(tdm)) does NOT contain words from print(corpus[[1]])

Frank Wang Frank Wang · Accepted Answer · 2013-08-28T08:30:50

I suspect the encoding works for the first step, you can try to inspect the first element of the corpus:

 corpus[[1]]

Unicode symbols when creating DocumentTermMatrix

1 Answers