1
votes

I created a corpus in R using package tm specifying language and encoding as follows:

de_DE.corpus <- Corpus(VectorSource(de_DE.sample), readerControl
    = list(language="de_DE",encoding = "UTF_8"))
de_DE.corpus[36]$content
de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list
    (encoding = 'UTF-8'))
inspect(de_DE.dtm[, grepl("grÃ", de_DE.dtm$dimnames$Terms)])
inspect(de_DE.dtm[36, ])

If I see the content in de_DE.corpus[36]$content of document 36 which has 'ü' the text is shown correctly. e.g. " ...Single ist so die Begründung der Behörde Eine... "

But when I create the DocumentTermMatrix (I tried multiple options for encoding and language) I am getting words like "begrÃ" where for example is the word "Begründung". See result after executing inspect(de_DE.dtm[36, ]).

<<DocumentTermMatrix (documents: 1, terms: 21744)>>

Non-/sparse entries: 102/21642

Sparsity : 100%

Maximal term length: 43

Weighting : term frequency (tf)

Sample :

Terms

Docs begrà das dem der die eine einen jobcenter und zum

36     3    4   2  4   8     2    2       4       3  3

I would appreciate if someone knows how to fix the problem. Thanks in advance :)

1
Which Operating System are you on?knb
Windows 10, R Version 3.4.1, package ‘tm’ version 0.7-1Sandra Meneses
I don't know what's going on, but here's a potential clue: text <- "Begründung"; Encoding(text) ## [1] "UTF-8" Here's what happens if we set the wrong encoding: Encoding(text) <- "latin1"; print(text) ## [1] "Begründung"Patrick Perry
After many failed attempts the only solution that I found was: de_DE.corpus <- Corpus(VectorSource(de_DE.sample), readerControl = list(language="de_DE",encoding = "UTF_8")) de_DE.corpus <- tm_map(de_DE.corpus, function(x) iconv(x, from='UTF-8', to="latin1")) de_DE.corpus[4]$content de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list (encoding = 'UTF-8')) inspect(de_DE.dtm[4, ]) Hope it helps someone having the same issue.Sandra Meneses

1 Answers

0
votes

Can you check your input data? Because your code works for me. So I think you have an issue when you are loading it already in de_DE.sample.

doc<-c("Single ist so die Begründung der Behörde Eine", "Single Begründung Behörde ")

de_DE.corpus <- Corpus(VectorSource(doc), readerControl
                       = list(language="de_DE",encoding = "UTF_8"))
de_DE.dtm <- DocumentTermMatrix(de_DE.corpus,control = list
                                (encoding = 'UTF-8'))

inspect(de_DE.dtm[1, ])
<<DocumentTermMatrix (documents: 1, terms: 7)>>
Non-/sparse entries: 7/0
Sparsity           : 0%
Maximal term length: 10
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs begründung behörde der die eine ist single
   1          1       1   1   1    1   1      1