Generating rolling z-scores of panel data in Stata

Question

I have an unbalanced panel data set (countries and years). For simplicity let's say I have one variable, x, that I am measuring. The panel data sorted first by country (a 3-digit numeric country-code) and then by year. I would like to write a .do file that generates a new variable, z_x, containing the standardized values of the variable x. The variables should be standardized by subtracting the mean from the preceding (exclusive) m time periods, and then dividing by the standard deviation from those same time periods. If this is not possible, return a missing value.

Currently, the code I am using to accomplish this is the following (edited now for clarity)

xtset weocountrycode year
sort weocountrycode year

local win_len = 5 // Defining rolling window length.

quietly: rolling sd_x=r(sd) mean_x=r(mean), window(`win_len') saving(stats_x, replace): sum x 
use stats_x, clear
rename end year
save, replace
use all_data_PROCESSED_FINAL.dta, clear
quietly: merge 1:1 (weocountrycode year) using stats_x  
replace sd_x = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] !=  weocountrycode[_n] // This and next line are for deleting values that rolling calculates when I actually want missing values. 
replace mean_`x' = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n]
gen z_`x' = (`x' - mean_`x'[_n-1])/sd_`x'[_n-1] // calculate z-score

UPDATE:

My struggle with rolling is that when rolling is set up to use a window length 5 rolling mean, it automatically does window length 1,2,3,4 means for the first, second, third and fourth entries (when there are not 5 preceding entries available to average out). In fact, it does this in general - if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ..... and then finally start doing length 5 moving averages on entry 9. My issue is that I do not want this, so I would like to avoid performing these calculations. Until now, I have only been able to figure out how to delete them after they are done, which is both inefficient and bothersome.

I tried adding an if clause to the -rolling- statement:

quietly: rolling sd_x=r(sd) mean_x=r(mean) if x[_n-`win_len'+1] != . & weocountrycode[_n-`win_len'+1] != weocountrycode[_n], window(`win_len') saving(stats_x, replace): sum x

But it did not fix the problem and the output is "weird" in the sense that

1) If `win_len' is equal to, say, 10, there are 15 missing values in the resulting z_x variable, instead of 9. 2) Even though there are "extra" missing values in z_x, the observations still start out as window length 1 means, then window length 2 means, etc. which makes no sense to me.

Which leads me to believe I fundamentally don't understand 1) what -rolling- is doing and 2) how an if clause works in the context of -rolling-.

Does this help?

Thanks!

In fact, too much detail here for my taste. Can you isolate in a single sentence what rolling doesn't do that you want or does do that you don't want? — Nick Cox
Updated above. In a single sentence: rolling tries to do an average when it can't really do what I asked it to; I would rather it just return a missing value when the data is not available, but instead it just tries to work what it has available. — nfernand
If you save the results of r(N) after summarize you can replace other results based on incomplete windows by missing. — Nick Cox
So, no way to have rolling simply not calculate these values? It's wasting a lot of time and printing a whole lot of e's. — nfernand
You could prefix rolling with quietly or use the nodots option to suppress display. Can you quantify extra computation time compared with your time spent trying to fix what you don't like? — Nick Cox

Roberto Ferrer Roberto Ferrer · Accepted Answer · 2014-02-25T00:52:07

I'm not sure I understand completely but I'll try to answer based on what I think your problem is, and based on a comment by @NickCox.

You say:

... when rolling is set up to use a window length 5 rolling mean... if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ...

This is expected. help rolling states:

The window size refers to calendar periods, not the number of observations. If there are missing data (for example, because of weekends), the actual number of observations used by command may be less than window(#).

It's not actually doing a "length 1 rolling average", but I get to that later. Below some examples to see what rolling does:

clear all
set more off

*-------------------------- example data -----------------------------

set obs 92

gen dat = _n - 1
format dat %tq

egen seq = fill(1 1 1 1 2 2 2 2)

tsset dat

tempfile main
save "`main'"

list in 1/12, separator(4)


*------------------- Example 1. None missing ------------------------

rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)


*------- Example 2. All but one value, missing in first window ------

use "`main'", clear
replace seq = . in 1/3
list in 1/8

rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)


*------------- Example 3. All missing in first window --------------

use "`main'", clear
replace seq = . in 1/4
list in 1/8

rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)

Note I use the stepsize option to make things much easier to follow. Because the date variable is in quarters, I set windowsize(4) and stepsize(4) so rolling is just computing averages by year. I hope that's easy to see.

Example 1 does as expected. No problem here.
Example 2 on the other hand, should be more interesting for you. We've said that what matters are calendar periods, so the mean is computed for the whole year (four quarters), even though it contains missings. There are three missings and one non-missing. summarize is computing the mean over the whole year, but summarize ignores missings, so it just outputs the mean of non-missings, which in this case is just one value.
Example 3 has missings for all four quarters of the year. Therefore, summarize outputs . (missing).

Your problem, as I understand it, is that when you face a situation like Example 2, you'd like the output to be missing. This is where I think Nick Cox's advice comes in. You could try something like:

rolling mean=r(mean) N=r(N), window(4) stepsize(4) clear: summarize seq, detail
replace mean = . if N != 4
list in 1/12, separator(0)

This says: if the number of non-missings for the window (r(N), also computed by summarize), is not the same as the window size, then replace it with missing.

Generating rolling z-scores of panel data in Stata

1 Answers