1
votes

I am trying to read a Stata dataset in R with the foreign package, but when I try to read the file using:

library(foreign)
data <- read.dta("data.dta")

I got the following error:

Error in read.dta("data.dta") : a binary read error occurred

The file works fine in Stata. This site suggests saving the file in Stata without labels and then reading it into R. With this workaround I am able to load the file into R, but then I lose the labels. Why am I getting this error and how can I read the file into R with the labels? Another person finds that they get this error when they have variables with no values. My data do have at least one or two such variables, but I have no easy way to identify those variables in stata. It is a very large file with thousands of variables.

4
There are several ways to test the missings in Stata even if you have large number of variables. See here.Metrics
The version of Stata used to make the file could be the problem. Read the help page for read.dta carefully and then do whatever work is needed to construct with the required version.IRTFM

4 Answers

2
votes

You should call library(foreign) before reading the Stata data.

library(foreign)
data <- read.dta("data.dta")

Updates: As mentioned here,

"The error message implies that the file was found, and that it started with the right sequence of bytes to be a Stata .dta file, but that something (probably the end of the file) prevented R from reading what it was expecting to read. "

But, we might be just guessing without any further information.

Update to OP's question and answer:

I have tried whether that is the case using auto data from Stata, but its not.So, there should be other reasons:

*Claims 1 and 2: if there is missings in variable or there is dataset with labels, R read.dta will generate the error *

sysuse auto #this dataset has labels
replace mpg=. #generates missing for mpg variable
br in 1/10
make    price   mpg rep78   headroom    trunk   weight  length  turn    displacement    gear_ratio  foreign
AMC Concord 4099        3   2.5 11  2930    186 40  121 3.58    Domestic
AMC Pacer   4749        3   3.0 11  3350    173 40  258 2.53    Domestic
AMC Spirit  3799            3.0 12  2640    168 35  121 3.08    Domestic
Buick Century   4816        3   4.5 16  3250    196 40  196 2.93    Domestic
Buick Electra   7827        4   4.0 20  4080    222 43  350 2.41    Domestic
Buick LeSabre   5788        3   4.0 21  3670    218 43  231 2.73    Domestic
Buick Opel  4453            3.0 10  2230    170 34  304 2.87    Domestic
Buick Regal 5189        3   2.0 16  3280    200 42  196 2.93    Domestic
Buick Riviera   10372       3   3.5 17  3880    207 43  231 2.93    Domestic
Buick Skylark   4082        3   3.5 13  3400    200 42  231 3.08    Domestic

save "~myauto"
de(myauto)

Contains data from ~\myauto.dta
  obs:            74                          1978 Automobile Data
 vars:            12                          25 Aug 2013 11:32
 size:         3,478 (99.9% of memory free)   (_dta has notes)
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
make            str18  %-18s                  Make and Model
price           int    %8.0gc                 Price
mpg             int    %8.0g                  Mileage (mpg)
rep78           int    %8.0g                  Repair Record 1978
headroom        float  %6.1f                  Headroom (in.)
trunk           int    %8.0g                  Trunk space (cu. ft.)
weight          int    %8.0gc                 Weight (lbs.)
length          int    %8.0g                  Length (in.)
turn            int    %8.0g                  Turn Circle (ft.)
displacement    int    %8.0g                  Displacement (cu. in.)
gear_ratio      float  %6.2f                  Gear Ratio
foreign         byte   %8.0g       origin     Car type
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:  foreign


library(foreign)
myauto<-read.dta("myauto.dta")  #works perfect
    str(myauto)
'data.frame':   74 obs. of  12 variables:
 $ make        : chr  "AMC Concord" "AMC Pacer" "AMC Spirit" "Buick Century" ...
 $ price       : int  4099 4749 3799 4816 7827 5788 4453 5189 10372 4082 ...
 $ mpg         : int  NA NA NA NA NA NA NA NA NA NA ...
 $ rep78       : int  3 3 NA 3 4 3 NA 3 3 3 ...
 $ headroom    : num  2.5 3 3 4.5 4 4 3 2 3.5 3.5 ...
 $ trunk       : int  11 11 12 16 20 21 10 16 17 13 ...
 $ weight      : int  2930 3350 2640 3250 4080 3670 2230 3280 3880 3400 ...
 $ length      : int  186 173 168 196 222 218 170 200 207 200 ...
 $ turn        : int  40 40 35 40 43 43 34 42 43 42 ...
 $ displacement: int  121 258 121 196 350 231 304 196 231 231 ...
 $ gear_ratio  : num  3.58 2.53 3.08 2.93 2.41 ...
 $ foreign     : Factor w/ 2 levels "Domestic","Foreign": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "datalabel")= chr "1978 Automobile Data"
 - attr(*, "time.stamp")= chr "25 Aug 2013 11:23"
 - attr(*, "formats")= chr  "%-18s" "%8.0gc" "%8.0g" "%8.0g" ...
 - attr(*, "types")= int  18 252 252 252 254 252 252 252 252 252 ...
 - attr(*, "val.labels")= chr  "" "" "" "" ...
 - attr(*, "var.labels")= chr  "Make and Model" "Price" "Mileage (mpg)" "Repair Record 1978" ...
 - attr(*, "expansion.fields")=List of 2
  ..$ : chr  "_dta" "note1" "from Consumer Reports with permission"
  ..$ : chr  "_dta" "note0" "1"
 - attr(*, "version")= int 12
 - attr(*, "label.table")=List of 1
  ..$ origin: Named int  0 1
  .. ..- attr(*, "names")= chr  "Domestic" "Foreign"
2
votes

Here's a solver list. My guess is that the first item has a 75% likelihood to solve your issue.

  1. In Stata, resave a fresh copy of your dta file with saveold, and try again.
  2. If that fails, provide a sample to show what kind of values kill the read.dta function.
  3. If missing values are to blame, run the loop from the other answer.

A more thorough description of the dataset would be required to work past that point. The issue seems fixable, I've never had much trouble using foreign with tons of Stata files.

You might also give a try to the Stata.file function in the memisc package to see if that fails too.

0
votes

I do not know why this occurs and would be interested if anyone could explain, but read.dta indeed cannot handle variables that are all NA. A solution is to delete such variables in Stata with the following code:

foreach varname of varlist * {
 quietly sum `varname'
 if `r(N)'==0 {
  drop `varname'
  disp "dropped `varname' for too much missing data"
 }
}
0
votes

It's been a lot of time, but I solved this same problem exporting the .dta data to .csv. The problem was related to the labels of the factor variables, especially because the labels were in Spanish and the ASCII encoding is a mess. I hope this work for someone with the same problem and with Stata software.

In stata:

export delimited using "/Users/data.csv", nolabel replace

In R:

df <- read.csv("lapop2014.csv")