3
votes

How can one dodge the encoding problems when reading Stata-data into R?

The dataset I wish to read is a .dta in either Stata 12 or Stata 13 (before Stata introduced support for utf-8 in version 14). Text-variables with Swedish and German letters å, ä, ö, ß, as well as other characters do not import well.

I have tried these answers, read.dta in foreign, the haven package (with no encoding-parameters), and now read_stata13, which informs me that it expects Stata files to be encoded in CP1252. But alas, the encoding doesn't work. Should I give up and and use a .csv-export as a bridge instead, or is it actually possible to read .dta-files in R?

Minimal example:
This code downloads the first few lines of my dataset, and illustrates the problem, for example in the variable vocation which contain Scandinavian languages.

setwd("~/Downloads/")
system("curl -O http://www.lilljegren.com/stackoverflow/example.stata13.dta", intern=F)

library(foreign)
?read_dta
df1 <- read_dta('example.stata13.dta', encoding="latin1")
df2 <- read_dta('example.stata13.dta', encoding="CP1252")
library(readstata13)
df3 <- read.dta13('example.stata13.dta', fromEncoding="latin1")
df4 <- read.dta13('example.stata13.dta', fromEncoding="CP1252")
df5 <- read.dta13('example.stata13.dta', fromEncoding="utf-8")

vocation <- c("Brandkorpral","Sömmerska","Jungfru","Timmerman","Skomakare","Skräddare","Föreståndare","Platsförsäljare","Sömmerska")
df4$vocation == vocation
# [1]  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
1
csv is probably the best thing to do. Or if you have Stata 14 convert the files to Unicode first and save.user8682794
This is what I'm fearing. I'm looking at different files Stata builds using enca, but it is not able to guess what encoding they are, and I also have some encoding problems reading the csv-files that Stata generates. Uhhh. Stata really isn't awesome :/ 21st century software without support for utf-8 :(nJGL
Stata's current version is 15 and as of version 14 supports Unicode. Not sure why you are complaining for features that are not available in software that is two versions behind and no longer supported / maintained. Upgrade?user8682794
I am poor, and Stata is a licensed software that'd cost me expensively for an upgrade needed merely to resolve this encoding-problem that, I think one could argue, shouldn't have to belong to our decade. But duly noted: I was grumpy. :) Besides, the correct encoding was "macroman", and I found out by going through the csv-solution, as you suggested, so thank you.nJGL

1 Answers

4
votes

The correct encoding to read files generated by Stata prior to version 14 on Macs is "macroman"

df <- read.dta13('example.stata13.dta', fromEncoding="macroman")

On my Mac, both .dta-files in stata13 and stata12 formats (saved by saveold in Stata 13) imported nicely like this.

Supposedly, the manual of read_stata13, correctly assumes "CP1252" on other platforms. To me, "macroman", however, did the trick, (also for the .csv-files that Stata 13 generated with export delimited).