How to delete 0.1 percent of last observations

Question

I have variable name and price. I want to delete 0.1 percent of last observations of price for each name.

last observations are highest on price. There are no missing values. If there aren't at least 1000 observations for each distinct name, just final observation deletes.

How can I do that using Stata?

What are the last observations? Latest in time? Highest on price? What do you want to do about missing values? Do you have at least 1000 observations for each distinct name? If not, what happens? Too vague, especially without any example data or attempts at code. — Nick Cox
What are last observations? highest on price What do you want to do with missing values? There are no missing values. — Amin Karimi
Please edit your question, including answers to my other queries. — Nick Cox

Nick Cox Nick Cox · Accepted Answer · 2018-04-03T17:38:36

I wouldn't delete (taken to mean drop) observations just because they are not needed for some purpose. Given distinct values of name then a new variable ntoflag

bysort name (price) : gen ntoflag = floor(_N / 1000)

will automatically be 0 if the number of observations is less than 1000

and so considering the complementary definition

by name: gen long ntokeep = _N - floor(_N/1000)

leads to

bysort name (price) : gen flag = _n > (_N - floor(_N/1000))

as a one-line solution for an indicator for observations to ignore. (Its negation is an indicator for observations to use.)

However, here is a thought experiment. Suppose you have 1000 prices and the top 7 prices are all 999. So, you want to ignore 0.1% = 1/1000. Which of those 7 do you want to ignore? Now consider that there may be different values for other variables in the same observations. In short, you need an explicit consistent methodology for ties.

To show how this works, here is a reproducible experiment for any Stata users with a much smaller dataset and a threshold of the top 5% by car origin.

. sysuse auto, clear
(1978 Automobile Data)

. bysort foreign (price) : gen flag = _n > (_N - floor(0.05 * _N))

. list foreign price if flag

     +-------------------+
     |  foreign    price |
     |-------------------|
 51. | Domestic   14,500 |
 52. | Domestic   15,906 |
 74. |  Foreign   12,990 |
     +-------------------+

. bysort foreign : su price

----------------------------------------------------------------------------------
-> foreign = Domestic

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         52    6072.423    3097.104       3291      15906

----------------------------------------------------------------------------------
-> foreign = Foreign

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       price |         22    6384.682    2621.915       3748      12990

How to delete 0.1 percent of last observations

1 Answers