Coding dichotomous variables in Stata

Question

I have a set of dichotomous variables for firm size: emp1_2 (i.e. firm with 1 or 2 employed people, including the owner), emp3_9, emp10_19, emp20_49, emp50_99, emp100_249, emp250_499, emp500, plus I do not have information on 27 firms size but I have an educated guess that they are large firms.

I want to create a dichotomous variable for a firm being a "small firm"; therefore, this variable equals 1 when emp1_2==1 | emp3_9==1 | emp10_19==1 equals 1, and 0 otherwise.

To my understanding of Stata, of which I am a bare user, the two following methods to construct dichotomous variables should be equivalent.

Method 1)

gen lar_firm = 0
replace lar_firm = 1 if emp1_2==1 | emp3_9==1 | emp10_19==1

Method 2)

gen lar_firm = (emp1_2 | emp3_9 | emp10_19)

Instead I have found out that with method 2) lar_firm equals 1 for firms for which emp1_2 | emp3_9 | emp10_19 and for firms that do not enter in any of the categories (i.e. emp1_2, emp3_9, emp10_19, emp20_49, emp50_99, emp100_249, emp250_499, emp500) but for which I have an educated guess that they are large firms.

I am wondering whether there is some subtle difference between the two methods. I though they should lead to equal outcomes.

please post data example, and output so we can test it.. With a small dataset I made, I don't see any difference. you do not need the ( ... ) is the method 2 though — timat
I cannot post these data, not even a part of it, unfortunately--I am not allowed. In principle, if there was a subtle difference in how the two methods work, more experienced Stata users might now about it without need to work on data. — Fuca26
For instance, I believe that method 1 assigns values to all the units of observations, while methods 2 excludes from the assignment of values (either 1 or 0) those unit of observations with missing values for the conditions I have established. I cannot understand why instead it seems like it assigned the value 1 to such units of observations. — Fuca26
you can always post fake data.. I made fake data for what you explain, and I can post it, so do you.. — timat

timat timat · Accepted Answer · 2016-10-21T15:37:34

When you do

gen lar_firm = emp1_2 | emp3_9 | emp10_19

you're testing if

(emp1_2 != 0) | (emp3_9 != 0) |(emp10_19 != 0)

In particular, missing values . are different from 0: they are greater in fact.

For more information:

http://www.stata.com/support/faqs/data-management/logical-expressions-and-missing-values/

Coding dichotomous variables in Stata

1 Answers