1
votes

I'm doing some text mining involving Portuguese text. Some of my custom text mining functions also have other special characters in them.

I'm no expert on this topic. When a lot of my characters started displaying incorrectly, I assumed I needed to change the file encoding. I tried

  • ISO-8858-1
  • ISO-8858-7
  • UTF-8
  • WINDOWS-1252

None of them improved the display of characters. Do I need a different encoding or am I going about this all wrong?

For example, when I try to read this list of stopwords from GitHub:

stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt") 

They come out like this:

tail(stop_words, 17)
206    tivéramos
207         tenha
208      tenhamos
209        tenham
210       tivesse
211   tivéssemos
212      tivessem
213         tiver
214      tivermos
215       tiverem
216         terei
217         terá
218       teremos
219        terão
220         teria
221     teríamos
222        teriam

I've also tried it with stringsAsFactors = F.

I don't speak Portuguese, but my instinct tells me that the Euro and copyright symbols are not in their alphabet. Also, it seems to be changing some accented lowercase e's to uppercase differently accented A's.

In case it's helpful:

Sys.getlocale()

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

I also tried changing locale, stri_encode(stop_words$V1, "", "UTF-8") and tail(enc2native(as.vector(stop_words[,1])),17).

2
I don't think the problem is with the Portuguese alphabet. When I get the stop_words from GitHub with your code above, I can see the characters properly formatted. How are you changing the file encoding?Oriol Mirosa
@OriolMirosa I had the problem before changing encoding from my system default, which is ISO-8859-1. I tried changing it using RStudio (Reopen with encoding) then repulling the data. I also tried changing it with the stringi package. I think that the answer below is correct that it's being double-encoded somehow, but I don't know why or how to fix it.Hack-R
Have you tried enc2utf8(as.vector(stop_words[,1])) or enc2native(as.vector(stop_words[,1]))Oriol Mirosa
@OriolMirosa I had not tried that, thanks. I just tried it now after your reading your comment, but the problem is still there.Hack-R
Hmm... What system are you in? Do you use RStudio? What font are you using for your R terminal? Can you see tildes and other latin characters in your terminal? (if your keyboard is in English, press alt+e and then e to get 'é')Oriol Mirosa

2 Answers

1
votes

I am Portuguese and I had the same problem though my encoding is

Sys.getlocale()
[1] "LC_COLLATE=Portuguese_Portugal.1252;LC_CTYPE=Portuguese_Portugal.1252;LC_MONETARY=Portuguese_Portugal.1252;LC_NUMERIC=C;LC_TIME=Portuguese_Portugal.1252"

So I looked it up online and found this tip in SO.

stop_words2 <- sapply(stop_words, as.character)

It worked. But I read in the data using read.table(..., stringsAsfactors = FALSE).

1
votes

You seem to be double encoding to utf-8.

Here is a chart of the characters in utf-8: http://www.i18nqa.com/debug/utf8-debug.html.
Now look at the "Actual" column.

As you can see, the characters printed seems to represent the actual value instead of the encoded value.

A temporary fix would be to decode one layer of utf-8.

Update:

After installing R, I tried to reproduce the problem.
Here is my console log with a simple explanation:

First, I copy pasted your code:

> stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt")
> tail(stop_words, 17)
             V1
206  tivéramos
207       tenha
208    tenhamos
209      tenham
210     tivesse
211 tivéssemos
212    tivessem
213       tiver
214    tivermos
215     tiverem
216       terei
217       terá
218     teremos
219      terão
220       teria
221   teríamos
222      teriam

Ok, so it didn't work as is, so I added the encoding parameter at the end of the read.table function. There goes the result when I tried with lower case utf-8:

> stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt",encoding="utf-8")
> tail(stop_words, 17)
             V1
206  tivéramos
207       tenha
208    tenhamos
209      tenham
210     tivesse
211 tivéssemos
212    tivessem
213       tiver
214    tivermos
215     tiverem
216       terei
217       terá
218     teremos
219      terão
220       teria
221   teríamos
222      teriam

Finally, I used UTF-8 with capital letters and now it works properly:

> stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt", encoding = "UTF-8")
> tail(stop_words, 17)
            V1
206  tivéramos
207      tenha
208   tenhamos
209     tenham
210    tivesse
211 tivéssemos
212   tivessem
213      tiver
214   tivermos
215    tiverem
216      terei
217       terá
218    teremos
219      terão
220      teria
221   teríamos
222     teriam

You might have forgotten to put the encoding parameter at the end of read.table or tried it with lower case instead of upper. What I understand from this is that R tries to cast the characters to UTF-8 if you don't specify that it is already encoded in it.