How can one dodge the encoding problems when reading Stata-data into R?
The dataset I wish to read is a .dta in either Stata 12 or Stata 13 (before Stata introduced support for utf-8 in version 14). Text-variables with Swedish and German letters å, ä, ö, ß, as well as other characters do not import well.
I have tried these answers, read.dta
in foreign
, the haven
package (with no encoding-parameters), and now read_stata13
, which informs me that it expects Stata files to be encoded in CP1252. But alas, the encoding doesn't work. Should I give up and and use a .csv-export as a bridge instead, or is it actually possible to read .dta-files in R?
Minimal example:
This code downloads the first few lines of my dataset, and illustrates the problem, for example in the variable vocation
which contain Scandinavian languages.
setwd("~/Downloads/")
system("curl -O http://www.lilljegren.com/stackoverflow/example.stata13.dta", intern=F)
library(foreign)
?read_dta
df1 <- read_dta('example.stata13.dta', encoding="latin1")
df2 <- read_dta('example.stata13.dta', encoding="CP1252")
library(readstata13)
df3 <- read.dta13('example.stata13.dta', fromEncoding="latin1")
df4 <- read.dta13('example.stata13.dta', fromEncoding="CP1252")
df5 <- read.dta13('example.stata13.dta', fromEncoding="utf-8")
vocation <- c("Brandkorpral","Sömmerska","Jungfru","Timmerman","Skomakare","Skräddare","Föreståndare","Platsförsäljare","Sömmerska")
df4$vocation == vocation
# [1] TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
csv
is probably the best thing to do. Or if you have Stata 14 convert the files to Unicode first and save. – user8682794enca
, but it is not able to guess what encoding they are, and I also have some encoding problems reading the csv-files that Stata generates. Uhhh. Stata really isn't awesome :/ 21st century software without support for utf-8 :( – nJGL"macroman"
, and I found out by going through thecsv
-solution, as you suggested, so thank you. – nJGL