0
votes

I am struggling with a question in Cameron and Trivedi's "Microeconometrics using Stata". The question concerns a cross-sectional dataset with two key variables, log of annual earnings (lnearns) and annual hours worked (hours).

I am struggling with part 2 of the question, but I'll type the whole thing for context.

A moving average of y after data are sorted by x is a simple case of nonparametric regression of y on x.

  1. Sort the data by hours.
  2. Create a centered 15-period moving average of lnearns with ith observation yma_i = 1/25(sum from j=-12 to j=12 of y_i+j). This is easiest using the command forvalues.
  3. Plot this moving average against hours using the twoway connected graph command.

I'm unsure what command(s) to use for a moving average of cross-sectional data. Nor do I really understand what a moving average over one-period data shows.

Any help would be great and please say if more information is needed. Thanks!

Edit1:

Should be able to download the dataset from here https://www.dropbox.com/s/5d8qg5i8xdozv3j/mus02psid92m.dta?dl=0. It is a small extract from the 1992 Individual-level data from the Panel Study of Income Dynamics - used in the textbook.

Still getting used to the syntax, but here is my attempt at it

sort hours
gen yma=0 
1. forvalues i = 1/4290 {
2. quietly replace yma = yma + (1/25)(lnearns[`i'-12] to lnearns[`i'+12]) 
3. }
2
What's needed is the dataset or an indication of where a downloadable dataset can be found. Also, what code you tried.Nick Cox
Hi Nick, I've edited my original post with a downloadable dataset and my attempt.Kai_M
Just to flag for others what you know: your code line numbered 2 is a long way short of legal. As you've been given suggestions that work, I will not dissect it.Nick Cox

2 Answers

1
votes

There are other ways to do this, but I created a variable for each lag and lead, then take the sum of all of these variables and the original then divide by 25 as in the equation you provided:

sort hours

// generate variables for the 12 leads and lags
forvalues i = 1/12 {
    gen lnearns_plus`i'  = lnearns[_n+`i']
    gen lnearns_minus`i' = lnearns[_n-`i']
}

// get the sum of the lnearns variables
egen yma = rowtotal(lnearns_* lnearns)

// get the number of nonmissing lnearns variables
egen count = rownonmiss(lnearns_* lnearns)

// get the average
replace yma = yma/count

// clean up
drop lnearns_* count

This gives you the variable you are looking for (the moving average) and also does not simply divide by 25 because you have many missing observations.

As to your question of what this shows, my interpretation is that it will show the local average for each hours variable. If you graph lnearn on the y and hours on the x, you get something that looks crazy becasue there is a lot of variation, but if you plot the moving average it is much more clear what the trend is.

0
votes

In fact this dataset can be read into a suitable directory by

net from http://www.stata-press.com/data/musr
net install musr
net get musr
u mus02psid92m, clear

This smoothing method is problematic in that sort hours doesn't have a unique result in terms of values of the response being smoothed. But an implementation with similar spirit is possible with rangestat (SSC).

sort hours
gen counter = _n
rangestat (mean) mean=lnearns (count) n=lnearns, interval(counter -12 12)

There are many other ways to smooth. One is

gen binhours = round(hours, 50)
egen binmean = mean(lnearns), by(binhours)
scatter lnearns hours, ms(Oh) mc(gs8) || scatter binmean binhours , ms(+) mc(red)

Even better would be to use lpoly.