1
votes

This has probably already been answered, but I must just be searching for the wrong terms. Suppose I am using the built in Stata data set auto:

sysuse auto, clear

and say for example I am working with 1 independent and 1 dependent variable and I want to essentially compress down to the IQR elements, min, p(25), median, p(75), max... so I use command,

keep weight mpg

sum weight, detail

return list

local min=r(min)

local lqr=r(p25)

local med = r(p50)

local uqr = r(p75)

local max = r(max)

keep if weight==`min' | weight==`max' | weight==`med' | weight==`lqr' | weight==`uqr'

Hence, I want to compress the data set down to only those 5 observations, and for example in this situation the median is not actually an element of the weight vector. there is an observation above and an observation below (due to the definition of median this is no surprise). is there a way that I can tell stata to look for the nearest neighbor above the percentile. ie. if r(p50) is not an element of weight then search above that value for the next observation? The end result is I am trying to get the data down to 2 vectors, say weight and mpg such that for each of the 5 elements of weight in the IQR have their matching response in mpg. Any thoughts?

2
Essentially the same question was cross-posted at statalist.org/forums/forum/general-stata-discussion/general/… See that thread for several comments and suggestions.Nick Cox

2 Answers

1
votes

I think you want something like:

clear all
set more off

sysuse auto
keep weight mpg

summarize weight, detail

local min = r(min)
local lqr = r(p25)
local med = r(p50)
local uqr = r(p75)
local max = r(max)

* differences between weights and its median
gen diff = abs(weight - `med')

* put the smallest difference in observation 1 (there can be several, watch out!)
isid diff weight mpg, sort

* replace the original median with the weight "closest" to the median
local med = weight[1]

keep if inlist(weight, `min', `lqr', `med', `uqr', `max')
drop diff

* pretty print
order weight mpg
sort weight mpg
list, sep(0)

Notice the median does not appear because we kept its "closest" neighbor instead (weight == 3,180). Also, percentile 75 has two associated mpg values.

You could probably work something out with collapse and merge (and many more), but I'll leave it at this.

Use help <command> for whatever is not clear.

0
votes

Thank you to all the suggestions, here is what I came up with. The idea is that I was pulling these 5 numbers so I could send them to mata for a cubic spline that I am attempting to write. For whatever reason trying to generalize this was giving me a headache.

My final solution:

    sysuse auto, clear
    preserve
    sort weight
    count if weight<.
    keep if _n==1 | _n==ceil(r(N)/4) | _n==ceil(r(N)/2) | _n==ceil(3*r(N)/4) | _n==_N
    gen X = weight
    gen Y = mpg
    list X Y 
    /* at this point I will send X and Y to mata for the cubic spline 
    routine that I am in the process of writing. It was this little step that 
    was bugging me. */

    restore