2
votes

Say that I have these data:

clear all
input n str6 G1 str6 G2 v desired computed
1 "B" "A" 1 5 .
2 "A" "A" 2 5.5 .
3 "C" "A" 3 4.5 .
4 "A" "B" 4 2 .
5 "B" "B" 5 2.5 .
6 "C" "B" 6 1.5 .
end

n is observation number, G1 is group 1, G2 is group 2 (say class 1 and class 2), and v is value. desired is the desired output, and computed will be the attempt at the desired output.

My goal is to perform ~in Stata~ an operation, in this example an average, over all observations that had no contact with the observation, including the observation itself---i.e., that were not in the same G1 or in the same G2 as the observation (or are that observation). For example, v for observation 1 would be the sum of the values of v for observations 4 and 6. (1, 2, and 3 are excluded because they share the same G2 as 1. 5 is also excluded because it shares the same G1 as 1.) So we sum the v of observations 4 and 6 and get 4+6=10 and divide by the number, 2, to get 5.

I think I can get what I want with the following code:

local N = _N
forvalues i = 1/`N' {
    preserve
    *create temp, which, when equal to 1, indicates the observations to make the calculation on
    gen temp = 1
    *save locals equal to the first and second group of `i'
    local temp_G1 = G1[`i']
    local temp_G2 = G2[`i'] 
    *make temp = 0 for observations that were in first and/or second group as `i'
    replace temp = 0 if G1=="`temp_G1'"
    replace temp = 0 if G2=="`temp_G2'"
    *compute sum on observations that have a temp equal to 1
    egen sum = sum(v) if temp==1
    *fill in the sum for all obs
    egen sum_all = max(sum)
    *compute number in group
    egen num = total(temp) if temp==1
    display "`num'"
    egen num_all = max(num)
    *save the value of the sum in a local
    local calc = sum_all[`i']/num_all[`i']
    restore
    *fill in the value from the local for row `i'
    replace computed = `calc' in `i'
}

However, this approach seems very long and inelegant. Is there a better way to go about this in Stata? I thought about using bys, but I couldn't figure it out. If it were only G1 or G2, I think it would be easier, but both together seem problematic with double counting---bys might include observations both in the G1 count and in the G2 count.

I guess another way to ask the question is if there is a way to do functions on each observation/row like R's apply family or if I need to use the clumsy loops approach like I do here.

1
You can do this without loops but the code will be longer and not more efficient. I think @NickCox's approach is the best for datasets of moderate size. - user8682794
The other thing to consider is that in other languages some functions simply hide the loops from the user but they are not really vectorized. - user8682794

1 Answers

2
votes

In this case, using preserve and restore would make it slow if you have a large data set. You have also generated several intermediate variables which may not be necessary. If I understand your question correctly, your code could be substantially simplified. I am using Stata 14:

local N = _N
forvalues i = 1/`N' {
    tempvar tempv 
    gen `tempv' = (G1 != G1[`i'] & G2 ! = G2[`i'])
    sum v if `tempv' == 1, 
    replace computed = r(sum)/r(N) in `i'
}

EDIT NJC:

This in turn can be simplified (and speeded up)

forvalues i = 1/`=_N' { 
    summarize  v if G1 != G1[`i'] & G2 != G2[`i'], meanonly 
    replace computed = r(mean) in `i' 
}

Note that the code above creates one new temporary variable for each observation, but you don't need any of them. Further, the option meanonly to summarize gives you the mean: it gives you other results too, but there is no need to calculate r(sum)/r(N) when summarize has already done it.

Feel free to merge this edit with the main question. I see no point in posting a different answer, unless and until I can see a way to avoid a loop over observations.