Stata: Extracting values and save them as scalars (and more)

Question

This question is a follow-up question from Stata: replace, if, forvalues. Consider this data:

set seed 123456
set obs 5000
g firmid = "firm" + string(_n)    /* Observation (firm) id */
g nw = floor(100*runiform())      /* Number of workers in a firm */
g double lat = 39+runiform()      /* Latitude in decimal degree of a firm */
g double lon = -76+runiform()     /* Longitude in decimal degree of a firm */

The first 10 observations are:

     +--------------------------------------+
     | firmid   nw         lat          lon |
     |--------------------------------------|
  1. |  firm1   81   39.915526   -75.505018 |
  2. |  firm2   35   39.548523   -75.201567 |
  3. |  firm3   10   39.657866    -75.17988 |
  4. |  firm4   83   39.957938   -75.898837 |
  5. |  firm5   56   39.575881   -75.169157 |
  6. |  firm6   73   39.886184   -75.857255 |
  7. |  firm7   27    39.33288   -75.724665 |
  8. |  firm8   75   39.165549    -75.96502 |
  9. |  firm9   64   39.688819   -75.232764 |
 10. | firm10   76   39.012228   -75.166272 |
     +--------------------------------------+

I need to calculate the distances between firm 1 and all other firms. So, the vincenty command looks like:

. scalar theLat = 39.915526
. scalar theLon = -75.505018
. vincenty lat lon theLat theLon, hav(distance_km) inkm

The vincenty command creates the distance_km variable that has distances between each observation and firm 1. Here, I manually copy and paste the two numbers that are 39.915526 and -75.505018.

Question 1: What's the syntax that extracts those numbers?

Now, I can keep observations where distances_km <= 2. And,

. egen near_nw_sum = sum(nw)

will create the sum of workers within 2 kilometers of the firm 1. (Or, the collapse command may do the job.)

Question 2: I have to do this for all firms, and the final data should look like:

     +-----------------------------------------------------------------+
     | firmid   nw         lat          lon            near_nw_sum     |
     |-----------------------------------------------------------------|
  1. |  firm1   81   39.915526   -75.505018  (# workers near firm1)    |
  2. |  firm2   35   39.548523   -75.201567  (# workers near firm2)    |
  3. |  firm3   10   39.657866    -75.17988  (# workers near firm3)    |
  4. |  firm4   83   39.957938   -75.898837  (# workers near firm4)    |
  5. |  firm5   56   39.575881   -75.169157  (# workers near firm5)    |
  6. |  firm6   73   39.886184   -75.857255  (# workers near firm6)    |
  7. |  firm7   27    39.33288   -75.724665  (# workers near firm7)    |
  8. |  firm8   75   39.165549    -75.96502  (# workers near firm8)    |
  9. |  firm9   64   39.688819   -75.232764  (# workers near firm9)    |
 10. | firm10   76   39.012228   -75.166272  (# workers near firm10)   |
     +-----------------------------------------------------------------+

Creating the near_nw_sum variable is my final goal. I need your help here for my weak data management skill.

Roberto Ferrer Roberto Ferrer · Accepted Answer · 2013-11-05T08:02:35

The following is basically the same strategy found here and is based on your "final goal". Again, it can be useful depending on the size of your original dataset.joinby creates observations so you may exceed the Stata limit. However, I believe it does what you want.

clear all
set more off

set seed 123456
set obs 10
g firmid = _n   /* Observation (firm) id */
g nw = floor(100*runiform())      /* Number of workers in a firm */
g double lat = 39+runiform()      /* Latitude in decimal degree of a firm */
g double lon = -76+runiform()     /* Longitude in decimal degree of a firm */
gen dum = 1
list

* joinby procedure
tempfile main
save "`main'"

rename (firmid lat lon nw) =0
joinby dum using "`main'"
drop dum

* Pretty print
sort firmid0 firmid
order firmid0 firmid
list, sepby(firmid0)

* Uncomment if you do not want to include workers in the "base" firm.
*drop if firmid0 == firmid

* Compute distance
vincenty lat0 lon0 lat lon, hav(distance_km) inkm
keep if distance_km <= 40 // an arbitrary distance
list, sepby(firmid0)

* Compute workers of nearby-firms
collapse (sum) nw_sum=nw (mean) nw0 lat0 lon0, by(firmid0)
list

What it does is form pairwise combinations of firms to compute distances and sum workers of nearby-firms. No need here to extract scalars as asked in Question 1. Also, no need to complicate the variable firmid converting to string.

The following overcomes the problem of the Stata limit on number of observations.

clear all
set more off

* Create empty database
gen x = .
tempfile results
save "`results'", replace

* Create input for exercise
set seed 123456
set obs 500
g firmid = _n   /* Observation (firm) id */
g nw = floor(100*runiform())      /* Number of workers in a firm */
g double lat = 39+runiform()      /* Latitude in decimal degree of a firm */
g double lon = -76+runiform()     /* Longitude in decimal degree of a firm */
gen dum = 1
*list

* Save number of firms
local size = _N
display "`size'"

* joinby procedure
tempfile main
save "`main'"

timer clear 1
timer clear 2
timer clear 3
timer clear 4

quietly {
    timer on 1
    forvalues i=1/`size'{
        timer on 2
        use "`main'" in `i', clear // assumed sorted on firmid
        rename (firmid lat lon nw) =0

        joinby dum using "`main'", unmatched(using)
        drop _merge dum
        order firmid0 firmid
        timer off 2

        timer on 3
        vincenty lat0 lon0 lat lon, hav(dist) inkm
        timer off 3
        keep if dist <= 40 // an arbitrary distance

        timer on 4
        collapse (sum) nw_sum=nw (mean) nw0 lat0 lon0, by(firmid0)

        append using "`results'"
        save "`results'", replace
        timer off 4
    }
    timer off 1
}

use "`results'", clear
sort firmid0
drop x
list

timer list

However inefficicent, some testing using timer shows that most of the computation time goes into the vincenty command which you won't be able to escape. The following is the time (in seconds) for 10,000 observations with an Intel Core i5 processor and a conventional hard drive (not SSD). Timer 1 is the total while 2, 3, 4 are the components (approx.). Timer 3 corresponds to vincenty:

. timer list
   1:   1953.99 /        1 =    1953.9940
   2:    169.19 /    10000 =       0.0169
   3:   1669.95 /    10000 =       0.1670
   4:     94.47 /    10000 =       0.0094

Of course, note that in both codes duplicate computations of distances are made (e.g. both the distances between firm1-firm2 and firm2-firm1 are computed) and this you can probably avoid. As it stands, for 110,000 observations it will take a long time. On the positive side, I noticed this second setup demands very little RAM as compared to the same amount of observations in the first setup. In fact, my 4GB machine freezes with the latter.

Also note that even though I use the same seed as you do, data is different because I create different numbers of observations (not 5000), which makes a difference in the variable creation process.

(By the way, if you wanted to save the value as a scalar you could use subscripting: scalar latitude = lat[1]).

Stata: Extracting values and save them as scalars (and more)

1 Answers