I'm doing some text mining involving Portuguese text. Some of my custom text mining functions also have other special characters in them.
I'm no expert on this topic. When a lot of my characters started displaying incorrectly, I assumed I needed to change the file encoding. I tried
- ISO-8858-1
- ISO-8858-7
- UTF-8
- WINDOWS-1252
None of them improved the display of characters. Do I need a different encoding or am I going about this all wrong?
For example, when I try to read this list of stopwords from GitHub:
stop_words <- read.table("https://gist.githubusercontent.com/alopes/5358189/raw/2107d809cca6b83ce3d8e04dbd9463283025284f/stopwords.txt")
They come out like this:
tail(stop_words, 17)
206 tivéramos 207 tenha 208 tenhamos 209 tenham 210 tivesse 211 tivéssemos 212 tivessem 213 tiver 214 tivermos 215 tiverem 216 terei 217 terá 218 teremos 219 terão 220 teria 221 terÃamos 222 teriam
I've also tried it with stringsAsFactors = F
.
I don't speak Portuguese, but my instinct tells me that the Euro and copyright symbols are not in their alphabet. Also, it seems to be changing some accented lowercase e's to uppercase differently accented A's.
In case it's helpful:
Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
I also tried changing locale, stri_encode(stop_words$V1, "", "UTF-8")
and tail(enc2native(as.vector(stop_words[,1])),17)
.
stringi
package. I think that the answer below is correct that it's being double-encoded somehow, but I don't know why or how to fix it. – Hack-Renc2utf8(as.vector(stop_words[,1]))
orenc2native(as.vector(stop_words[,1]))
– Oriol Mirosa