Is there a reason why there are two different commands to generate a new variable?
Is there a simple way to remember when to use gen
and when to use egen
?
They both create a new variable, but work with different sets of functions. You will typically use gen
when you have simple transformations of other variables in your dataset like
gen newvar = oldvar1^2 * oldvar2
In my workflow, egen
usually appears when I need functions that work across all observations, like in
egen max_var = max(var)
or more complex instructions
egen newvar = rowmax(oldvar1 oldvar2)
to calculate the maximum for each observation between oldvar1
and oldvar2
. I don't think there is a clear logic for separating the two commands.
gen
generate
may be abbreviated by gen
or even g
and can be used with the following mathematical operators and functions:
+
addition-
subtraction*
multiplication /
division ^
powerA large number of functions is available. Here are some examples:
abs(x)
absolute value of xexp(x)
antilog of xint(x) or trunc(x)
truncation to integer valueln(x), log(x)
natural logarithm of xround(x)
rounds to the nearest integer of xround(x,y)
x rounded in units of y (i.e., round(x,.1) rounds to one decimal place)sqrt(x)
square root of xruniform()
returns uniformly distributed numbers between 0 and nearly 1rnormal()
returns numbers that follow a standard normal distributionrnormal(x,y)
returns numbers that follow a normal distribution with a mean of x and a s.d. of yegen
A number of more complex possibilities have been implemented in the egen
command like in the following examples:
egen nkids = anycount(pers1 pers2 pers3 pers4 pers5), value(1)
egen v323r = rank(v323)
egen myindex = rowmean(var15 var17 var18 var20 var23)
egen nmiss = rowmiss(x1-x10 var15-var23)
egen nmiss = rowtotal(x1-x10 var15-var23)
egen incomst = std(income)
bysort v3: egen mincome = mean(income)
Detailed usage explanations can be found at this link.