0
votes

I want to create a new column in my cross-section survey dataset, that includes the education of a woman's husband. I have IDs of household (hid) and individual (HL1), and the following information:

  • MA1== whether woman is married (data only observable for women)
  • MA2== age of husband (data only observable for women who are married)
  • HL4== sex (data observable for all individuals)
  • HL6== age (data observable for all individuals)
  • ED4A== highest level of education (data observable for all individuals)

In essence, I want to create code to do the following:

  • First look at whether wife is currently married (MA1)
  • If yes, then look at age of husband (MA2)
  • Then pair age of husband (MA2) with age of male in household (HL6)
  • Then see what that male's education is (ED4A) and put that education in a new column, but in the same row as the woman's line number.

I tried this but it doesn't work: bysort hid (HL6) : gen husb_educ = ED4A[MA2]

Below is a sample from the dataset:

+-----+----------+-----+-----+--------+-----+----------+
| HL1 |   MA1    | MA2 | hid |  HL4   | HL6 |   ED4A   |
+-----+----------+-----+-----+--------+-----+----------+
|   1 |          |     | 106 | Male   |  57 | Diploma  |
|   2 |          |     | 106 | Female |  53 | Intermed |
|   3 |          |     | 106 | Male   |  30 | Higher S |
|   4 | No, not  |     | 106 | Female |  24 | Bachelor |
|   5 |          |     | 106 | Male   |  22 | Diploma  |
|   6 |          |     | 106 | Male   |  17 | Secondar |
|   7 |          |     | 106 | Female |  10 | Primary  |
|   8 | Yes, cur |  22 | 106 | Female |  23 | Diploma  |
|   9 |          |     | 106 | Female |   0 |          |
+-----+----------+-----+-----+--------+-----+----------+

So in this example, I want a new column which says Husband's education, and in row 8, puts Diploma as the value in the new column (since the woman's husband is 22years old, and the male 22 year old in the household has a diploma).

Same sample, without value labels:

+-----+-----+-----+-----+-----+-----+------+
| HL1 | MA1 | MA2 | hid | HL4 | HL6 | ED4A |
+-----+-----+-----+-----+-----+-----+------+
|   1 |     |     | 106 |   1 |  57 |    4 |
|   2 |     |     | 106 |   2 |  53 |    2 |
|   3 |     |     | 106 |   1 |  30 |    6 |
|   4 |   3 |     | 106 |   2 |  24 |    5 |
|   5 |     |     | 106 |   1 |  22 |    4 |
|   6 |     |     | 106 |   1 |  17 |    3 |
|   7 |     |     | 106 |   2 |  10 |    1 |
|   8 |   1 |  22 | 106 |   2 |  23 |    4 |
|   9 |     |     | 106 |   2 |   0 |      |
+-----+-----+-----+-----+-----+-----+------+

One especially large household:

    input
HL1 MA1 MA2 hid     HL4 HL6 ED4A
1   .   .   365809  1   33  1
2   1   33  365809  2   26  1
1   .   .   365810  1   58  1
2   .   .   365810  2   54  .
3   .   .   365810  1   23  3
4   .   .   365810  1   23  2
5   .   .   365810  1   18  3
6   .   .   365810  1   15  2
7   .   .   365810  2   12  2
8   .   .   365810  1   33  3
9   1   dk  365810  2   31  1
10  .   .   365810  2   13  2
11  .   .   365810  2   11  1
12  .   .   365810  1   9   1
13  .   .   365810  1   6   1
14  .   .   365810  2   3   .
15  .   .   365810  1   2   .
16  .   .   365810  1   33  3
17  1   33  365810  2   30  1
18  .   .   365810  1   8   1
19  .   .   365810  2   6   1
20  .   .   365810  2   5   .
21  .   .   365810  1   1   .
22  .   .   365810  1   32  4
23  1   32  365810  2   30  1
24  .   .   365810  1   5   .
25  .   .   365810  2   3   .
26  .   .   365810  1   2   .
27  .   .   365810  1   30  4
28  1   30  365810  2   28  1
29  .   .   365810  2   2   .
30  .   .   365810  1   0   .
31  .   .   365810  1   27  2
32  1   27  365810  2   27  1
33  .   .   365810  2   2   .
34  .   .   365810  2   0   .
         end 
2
Suppose the husband is aged 42. Then you are asking for ED4A[42] within the current household. That will usually be missing and it's not what you want. - Nick Cox
Give us a small sample of realistic data. Please don't ask us to imagine your data or make up our own example. - Nick Cox
Hi, thank you for your response. In this dataset, ED4A refers to the highest level of education that the individual has achieved during their lifetime (most of the adults would have completed their education anyways). So for the 42 year old, ED4A may be secondary education, or diploma, or missing, for example. Apologies if I was unclear. - Vidya
Evidently it was me who wasn't clear enough. I was asking for you to give us realistic data within your question. The definitions aren't the issue. - Nick Cox
I have done the requested. - Vidya

2 Answers

0
votes

As you have already outlined the steps necessary to do what you want, writing a simple script for this should not be problematic. In my experience its easier to learn the syntax if you write/do each step separately (and see what happens after each step, if any error is introduced etc). After getting the hang of it, you can slim the code down to one line. Something like this should work (tried following the steps in your question):

*look at wife currently married
*not necessary, as only married women have MA2, but next step takes only married women into account

* generate husbands age variable and spread to whole household (new var to keep original MA2 untouched)
gen husband_age=MA2 if MA1==married & HL4==woman
bys hid: egen husband_age_hid=max(husband_age)

*mark which individual is the husband (assumed this is what was meant by pairing age of husband with age of male in household)
gen husband=0
bys hid: replace husband = 1 if husband_age_hid == HL6

*copy husbands education information to the whole household
gen husband_ED4 = ED4 if husband==1
bys hid: egen husb_educ=max(husband_ED4)

*data cleaning, if necessary
drop husband*

could be better to use tempvars instead of generating new variables in the first step, but thought these variables could be useful later.

0
votes

This is a start. The code does loop over different married women within each household, but it does nothing if two or more men match the husband's age.

input  HL1  MA1  MA2  hid  HL4  HL6  ED4A 
  1    .   .     106    1   57     4 
  2    .   .     106    2   53     2 
  3    .   .     106    1   30     6 
  4    3   .     106    2   24     5 
  5    .   .     106    1   22     4 
  6    .   .     106    1   17     3 
  7    .   .     106    2   10     1 
  8    1  22     106    2   23     4 
  9    .   .     106    2    0     .    
 end 

bysort hid (MA1) : gen wid = _n if MA1 == 1 

su wid, meanonly 

local max = r(max) 

gen heducation = . 

quietly forval i = 1/`max' { 
    bysort hid : egen hage = min(cond(wid == `i', MA2, .)) 
    by hid : egen nmatches = total(HL4 == 1 & HL6 == hage) 
    by hid : egen work = min(cond(nmatches == 1 & HL6 == hage, ED4, .)) 
    replace heducation = work if wid == `i' 
    drop hage nmatches work 
}

sort hid HL1 

list 

     +-----------------------------------------------------------+
     | HL1   MA1   MA2   hid   HL4   HL6   ED4A   wid   heduca~n |
     |-----------------------------------------------------------|
  1. |   1     .     .   106     1    57      4     .          . |
  2. |   2     .     .   106     2    53      2     .          . |
  3. |   3     .     .   106     1    30      6     .          . |
  4. |   4     3     .   106     2    24      5     .          . |
  5. |   5     .     .   106     1    22      4     .          . |
     |-----------------------------------------------------------|
  6. |   6     .     .   106     1    17      3     .          . |
  7. |   7     .     .   106     2    10      1     .          . |
  8. |   8     1    22   106     2    23      4     1          4 |
  9. |   9     .     .   106     2     0      .     .          . |
     +-----------------------------------------------------------+

(UPDATE)

The extended example found a bug: one calculation was not restrictive enough by not excluding females with the same age. (Incidentally, note that the new data are for two households, not one.)

bysort hid (MA1) : gen wid = _n if MA1 == 1 

su wid, meanonly 

local max = r(max) 

gen heducation = . 

quietly forval i = 1/`max' { 
    bysort hid : egen hage`i' = min(cond(wid == `i', MA2, .)) 
    by hid : egen nmatches`i' = total(HL4 == 1 & HL6 == hage`i') 
    by hid : egen work`i' = min(cond(nmatches`i' == 1 & HL6 == hage`i' & HL4 == 1, ED4, .)) 
    replace heducation = work`i' if wid == `i' 
}

sort hid wid HL1 

    list hid wid MA2 HL6 ED4 heducation HL4 if inlist(HL6, 27, 30, 32, 33) | MA2 < ., sepby(hid) 

     +--------------------------------------------------+
     |    hid   wid   MA2   HL6   ED4A   heduca~n   HL4 |
     |--------------------------------------------------|
  1. | 365809     1    33    26      1          1     2 |
  2. | 365809     .     .    33      1          .     1 |
     |--------------------------------------------------|
  3. | 365810     1    27    27      1          2     2 |
  4. | 365810     2    33    30      1          .     2 |
  5. | 365810     3    32    30      1          4     2 |
  6. | 365810     4    30    28      1          4     2 |
 14. | 365810     .     .    33      3          .     1 |
 21. | 365810     .     .    33      3          .     1 |
 26. | 365810     .     .    32      4          .     1 |
 30. | 365810     .     .    30      4          .     1 |
 33. | 365810     .     .    27      2          .     1 |
     +--------------------------------------------------+

For more general discussions, see

here or here or here.