Stata: Uniquely sorting points within groups

Question

I'm conducting a household survey with a random sample of 200 villages. Using QGIS, I picked a random point 5-10km from my original villages. I then obtained, from the national statistical office, the village codes for those 200 "neighbor" villages - as well as a buffer of 10 additional neighbor villages. So my total sample is:

200 original villages + 210 neighbor villages = 410 villages, total

We're going to begin fieldwork soon, and I want to give each survey team a map for 1 original village + the nearest neighbor village. Because I'm surveying in some dense urban areas as well, sometimes a neighbor village is actually quite close to more than one original village.

My problem is this: if I run Distance Matrix in QGIS, matching an old village to its nearest neighbor village, I get duplicates in the latter. To get around this, I've matched each old village to the nearest 5 neighbor villages. My main idea/goal is to pick the nearest neighbor that hasn't already been picked.

I end up with a .csv like so:

enter image description here

As you can see, picking the five nearest villages, I'm getting repeats - neighbor village 79 is showing up as nearby to original villages 1, 2, 3, and 4. This is fine, as long as I can assign neighbor village 79 to one (and only one) original village, and then have the rest uniquely match as well.

What I want to do, then, is to uniquely match each original village to one neighbor village. I've tried a bunch of stuff, none of which has worked: My sense is that I need to loop over original village groups, assign a variable (e.g. taken==1) to one of the neighbor villages, and then - somehow - have each instance of that taken==1 apply to all instances of, say, neighbor village 79.

Here's some sample code of what I was thinking. Note: this uniquely matches 163 of my neighbors.

        gen taken = 0
        so ea distance

        by ea: replace taken=1 if _n==1

        keep if taken==1

        codebook FID ea

This also doesn't work; it just sets taken to 1 for all obs:

        foreach i in 5 4 3 2 1 {
            by ea: replace taken=1 if _n==`i' & taken==0
        }

What I need to do, I think, is loop over both _N and _n, and maybe use an if/else. But I'm not sure how to put it all together.

(Tangentially, is there a better way to loop over decreasing values in Stata? Similar to i-- in other programming languages?)

It's better to post data formatted as code, and not as an image. It facilitates copy/paste operations. Within Stata, you can use list, clean noobs and copy that here. — Roberto Ferrer

Roberto Ferrer Roberto Ferrer · Accepted Answer · 2014-11-14T13:09:21

This should work but the setup is a little different than what you say you need. By comparing with only five neighbors, you have an ill-posed problem. Imagine that geography is such that you end up with six (or more) original villages that have all the same list of five neighbors. What do you assign the sixth original village?

Given this, I compare the original village with all other villages, not only five. The strategy is then to assign original village 1 its closest neighbor; to original village 2 its closest neighbor after discarding the one previously assigned, and so on. This assumes equal number of original and neighbor villages but you have ten additional, so you need to give that a thought.

clear
set more off

*----- example data -----

local numvilla = 4 // change to test
local numobs = `numvilla'^2

set obs `numobs'

egen origv = seq(), from(1) to(`numvilla') block(`numvilla')
bysort origv: gen neigh = _n

set seed 1956
gen dist = runiform()*10

*----- what you want ? -----

sort origv dist

list, sepby(origv)

quietly forvalues villa = 1/`numvilla' {

    drop if origv == `villa' & _n > `villa'
    drop if neigh == neigh[`villa'] & _n > `villa'  

}

list

The other issue is that results will depend on which original village is set to first, second, and so on; because order of assignments will change according to that. That is, the order in which available options are discarded changes with the order in which you set up the original villages. You may want to randomize the order of the original villages before you start the assignments.

You can increase efficiency substituting out & _n > `villa' for in `=`villa'+1'/L, but you won't notice much with your sample size.

I'm not qualified to say anything about your sample design, so take this answer to address only the programming issue you pose.

By the way, to loop over decreasing values:

forvalues obs = 5(-1)1 {
    display "`obs'"
}

See help numlist.

Stata: Uniquely sorting points within groups

1 Answers