I wouldn't delete (taken to mean drop
) observations just because they are not needed for some purpose. Given distinct values of name
then a new variable ntoflag
bysort name (price) : gen ntoflag = floor(_N / 1000)
will automatically be 0 if the number of observations is less than 1000
and so considering the complementary definition
by name: gen long ntokeep = _N - floor(_N/1000)
leads to
bysort name (price) : gen flag = _n > (_N - floor(_N/1000))
as a one-line solution for an indicator for observations to ignore. (Its negation is an indicator for observations to use.)
However, here is a thought experiment. Suppose you have 1000 prices and the top 7 prices are all 999. So, you want to ignore 0.1% = 1/1000. Which of those 7 do you want to ignore? Now consider that there may be different values for other variables in the same observations. In short, you need an explicit consistent methodology for ties.
To show how this works, here is a reproducible experiment for any Stata users with a much smaller dataset and a threshold of the top 5% by car origin.
. sysuse auto, clear
(1978 Automobile Data)
. bysort foreign (price) : gen flag = _n > (_N - floor(0.05 * _N))
. list foreign price if flag
+-------------------+
| foreign price |
|-------------------|
51. | Domestic 14,500 |
52. | Domestic 15,906 |
74. | Foreign 12,990 |
+-------------------+
. bysort foreign : su price
----------------------------------------------------------------------------------
-> foreign = Domestic
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
price | 52 6072.423 3097.104 3291 15906
----------------------------------------------------------------------------------
-> foreign = Foreign
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
price | 22 6384.682 2621.915 3748 12990
name
? If not, what happens? Too vague, especially without any example data or attempts at code. – Nick Cox