0
votes

I have a "raw" data set that I´m trying to clean. The data set consists of individuals with the variable age between year 2000 and 2010. There are around 20000 individuals in the data set with the same problem.

The variable age is not increasing in the years 2004-2006. For example, for one individual it looks like this:

2000: 16, 
2001: 17,
2002: 18,
2003: 19,
2004: 19,
2005: 19,
2006: 19,
2007: 23,
2008: 24,
2009: 25,
2010: 26,

So far I have tried to generate variables for the max age and max year:

bysort id: egen last_year=max(year)
bysort id: egen last_age=max(age) 

and then use foreach combined with lags to try to replace age variable in decreasing order so that when the new variable last_age (that now are 26 in all years) rather looks like this:

2010: 26
2009: 25 (26-1)
2008: 24 (26-2) , and so on. 

However, I have some problem with finding the correct code for this problem.

2

2 Answers

1
votes

Assuming that for each individual the first value of age is not missing and is correct, something like this might work

bysort id (year): replace age = age[1]+(year-year[1])

Alternatively, if the last value of age is assumed to always be accurate,

bysort id (year): replace age = age[_N]-(year[_N]-year)

Or, just fix the ages where there is no observation-to-observation change in age

bysort id (year): replace age = age[_n-1]+(year-year[_n-1]) if _n>1 & age==age[_n-1]

In the absence of sample data none of these have been tested.

0
votes

William's code is very much to the point, but a few extra remarks won't fit easily into a comment.

Suppose we have age already and generate two other estimates going forward and backward as he suggests:

bysort id (year): gen age2 = age[1] + (year - year[1])
bysort id (year): gen age3 = age[_N] - (year[_N] - year)

Now if all three agree, we are good, and if two out of three agree, we will probably use the majority vote. Either way, that is the median; the median will be, for 3 values, the sum MINUS the minimum MINUS the maximum.

gen median = (age + age2 + age3) - max(age, age2, age3) - min(age, age2, age3) 

If we get three different estimates, we should look more carefully.

edit age* if max(age, age2, age3) > median & median > min(age, age2, age3) 

A final test is whether medians increase in the same way as years:

bysort id (year) : assert (median - median[_n-1]) == (year - year[_n-1]) if _n > 1