3
votes

I am quite new to R and more used to Stata.

I managed to read a database from Stata to a R data.frame using library(foreign).

data=read.dta("mydata.dta", 
     convert.dates = TRUE, 
     convert.factors = TRUE,
     missing.type = FALSE,
     convert.underscore = FALSE, 
     warn.missing.labels = TRUE)

Values (in the sens of Stata language) are however not imported, only labels are imported.

Let me explain it a little more. Assume I want to manipulate an education variable called "edu". In Stata language, I use numeric values instead of labels to manipulate my variable and the data editor shows the labels, so long as I have defined my labels. Assume for instance that my variable "edu" takes the values 10 to 40, the following code associates a label to each value:

label define lib_edu 
10 "Less than high-school degree" 
20 "12th grade or higher, no college degree" 
30 "Undergraduate level (2 to 4 years of college)" 
40 "Graduate level (5 years of college or more)", add;
label values edu lib_edu;

Then, when I want to manipulate my variable, I need to use the values. For example if I want to drop from my dataset people whose label is less than high-school degree, I simply do:

drop if edu==10

But in my imported R data.frame, the labels are being imported as factors. To each factor is associated a level which does not necessarily correspond to my Stata values since it restarts from 1. Meanwhile, I cannot use levels to manipulate my variable. If I want to drop from my dataset people whose label is less than high-school degree, I have to write the entire label:

data <- data[data$edu!="Less than high-school degree",]

which is not convenient at all, especially when the label is long and complex.

Is it possible to do as in Stata, that is: manipulate numeric values while editing the data.frame with labels, given that my data are exported from Stata?

Thanking you in advance.

1
Yeah, R factors always have integer codes counting up from one. However, once you know the new codes, you should be able to use them like f = factor(c("a","b")); f[ labels(f)[f] != 1 ] (excluding "a", which has a code of 1). Personally, I map the long labels to abbreviations and work with those ("none", "hs", "ug", "g")Frank
I get from your answer that 1) it is not possible to re-use Stata integer codes (at least not when directly importing the data to R); but that 2) it is possible to use R new levels; but that 3) the less tedious method remain to transform long labels into short ones. Thanks for your help.Elixterra
Yep. Oh, one more thing, a perk of R: you can have ordered factors and use inequalities, take maxes and mins, etc.: f = factor(c("hs", "hs", "none", "g"), levels=c("none","hs","ug","g"), ordered=TRUE); f[ f >= "ug" ]Frank

1 Answers

2
votes

You can approach this problem from two directions: 1. you can drop the value labels from within Stata before you import your data into R, or 2. you can change the data import settings for your data.frame from within R. Which of these two routes will be easier will depend to some degree on what version of Stata you have and the format of your data.

Option 1:

If you want to do this within Stata, I would recommend first reading about and possibly installing the "label utilities" package from SSC: sac inst labutil. This package contains, among many other very useful tools for manipulating labels, the labdtch or "label detach" command, which will dissociate your value labels from their actual values in the Stata data. Obviously, you would do all this before importing the data into R.

Option 2:

If your data has been saved using Stata version 13, the R package readstata13 will save you time and effort. To read about the package: see its manual on CRAN.

If using readstata13 is an option, you will need a combination of the commands get.label and/or get.label.name and use them as inputs to get.origin.codes which does exactly what you are looking for.

Finally, if using readstata13 is not an option, you should try specifying as.numeric(levels(f))[f] in your import command in R. For the reasons and more details, see this StackOverflow question.

I would recommend trying to accomplish this through R if possible, as it will give a more reproducible workflow. But if you end up doing this through Stata, I would include a short comment in your R file explaining what you did in Stata before importing the data.