Why aren't these various encodings allowing me to properly display Portuguese?

Question

I'm doing some text mining involving Portuguese text. Some of my custom text mining functions also have other special characters in them.

I'm no expert on this topic. When a lot of my characters started displaying incorrectly, I assumed I needed to change the file encoding. I tried

ISO-8858-1
ISO-8858-7
UTF-8
WINDOWS-1252

None of them improved the display of characters. Do I need a different encoding or am I going about this all wrong?

For example, when I try to read this list of stopwords from GitHub:

stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt")

They come out like this:

tail(stop_words, 17)

206    tivÃ©ramos
207         tenha
208      tenhamos
209        tenham
210       tivesse
211   tivÃ©ssemos
212      tivessem
213         tiver
214      tivermos
215       tiverem
216         terei
217         terÃ¡
218       teremos
219        terÃ£o
220         teria
221     terÃamos
222        teriam

I've also tried it with stringsAsFactors = F.

I don't speak Portuguese, but my instinct tells me that the Euro and copyright symbols are not in their alphabet. Also, it seems to be changing some accented lowercase e's to uppercase differently accented A's.

In case it's helpful:

Sys.getlocale()

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

I also tried changing locale, stri_encode(stop_words$V1, "", "UTF-8") and tail(enc2native(as.vector(stop_words[,1])),17).

I don't think the problem is with the Portuguese alphabet. When I get the stop_words from GitHub with your code above, I can see the characters properly formatted. How are you changing the file encoding? — Oriol Mirosa
@OriolMirosa I had the problem before changing encoding from my system default, which is ISO-8859-1. I tried changing it using RStudio (Reopen with encoding) then repulling the data. I also tried changing it with the stringi package. I think that the answer below is correct that it's being double-encoded somehow, but I don't know why or how to fix it. — Hack-R
Have you tried enc2utf8(as.vector(stop_words[,1])) or enc2native(as.vector(stop_words[,1])) — Oriol Mirosa
@OriolMirosa I had not tried that, thanks. I just tried it now after your reading your comment, but the problem is still there. — Hack-R
Hmm... What system are you in? Do you use RStudio? What font are you using for your R terminal? Can you see tildes and other latin characters in your terminal? (if your keyboard is in English, press alt+e and then e to get 'é') — Oriol Mirosa

Rui Barradas Rui Barradas · Accepted Answer · 2017-07-27T19:02:51

I am Portuguese and I had the same problem though my encoding is

Sys.getlocale()
[1] "LC_COLLATE=Portuguese_Portugal.1252;LC_CTYPE=Portuguese_Portugal.1252;LC_MONETARY=Portuguese_Portugal.1252;LC_NUMERIC=C;LC_TIME=Portuguese_Portugal.1252"

So I looked it up online and found this tip in SO.

stop_words2 <- sapply(stop_words, as.character)

It worked. But I read in the data using read.table(..., stringsAsfactors = FALSE).

Why aren't these various encodings allowing me to properly display Portuguese?

2 Answers