Keep unique IDs depending on frequency of occurrence

Question

Dataset description:

I have a highly unbalanced panel dataset, with some unique panelist IDs appearing only once, while others appear as much as 4,900 times. Each observation reflects an alcohol purchase associated with a unique product identifier (UPC). If my panelist purchased two separate brands (hence, two different UPCs) in the same day, same store, two distinct observations are created. However, seeing that these purchases were made on the same day and same store, I could safely assume that it was just one trip. Similarly, another panelist who also has 2 observations associated with the same store BUT different days of purchase (or vice versa) is assumed to make 2 store visits.

Task:

I would like to explore qualities of those people who purchased alcohol a certain number of times in the whole period. Thus, I need to identify panelists who made only 1) 1 visit, 2) 2 visits, 3) between 5 and 10 visits, 4) between 50 and 100 visits, etc.

I started by trying to identify panelists who made only 1 visit by tagging them by panelist id, day, and store. However, the program also tags the first occurrence of those who appear twice or more.

egen tag = tag(panid day store)

I also tried collapse but realized that it might not be the best solution because I want to keep my observations "as is" without aggregating any variables.

I will appreciate if you can provide me insight on how to identify such observations.

UPDATE:

panid   units dollars iri_key   upc                 day tag
1100560 1     5.989   234140    00-01-18200-00834   47  1
1101253 1     13.99   652159    00-03-71990-09516   251 1
1100685 1     20.99   652159    00-01-18200-53030   18  1
1100685 1     15.99   652159    00-01-83783-37512   18  0


1101162 1     19.99   652159    00-01-34100-15341   206 1  
1101162 1     19.99   652159    00-01-34100-15341   235 1
1101758 1     12.99   652159    00-01-18200-43381   30  1
1101758 1     6.989   652159    00-01-18200-16992   114 1
1101758 1     11.99   652159    00-02-72311-23012   121 1
1101758 2     21.98   652159    00-02-72311-23012   128 1
1101758 1     19.99   652159    00-01-18200-96550   223 1
1101758 1     12.99   234140    00-04-87692-29103   247 1
1101758 1     20.99   234140    00-01-18200-96550   296 1
1101758 1     12.99   234140    00-01-87692-11103   296 0
1101758 1     12.99   652159    00-01-87692-11103   317 1
1101758 1     19.99   652159    00-01-18200-96550   324 1
1101758 1     12.99   652159    00-02-87692-68103   352 1
1101758 1     12.99   652159    00-01-87692-32012   354 1

Hi Roberto, thanks for the feedback. This is a small sample of the dataset. In the first part of this particular example, we can safely assume that all three ids 1100560, 1101253, and 1100685 visited a store only once, i.e. made only one transaction each. The first two panelists obviously have only one record each, and the third panelist purchased 2 different UPCs in the same store, same day, i.e. in the same transaction.

The second part of the example has two panelists - 1101162 and 1101758 - who made more than one transaction: two and eleven, respectively. (Panelist 1101758 has 12 observations, but only 11 distinct trips.)

I would like to identify an exact number of distinct trips (or transactions) panelists of my dataset made:

panid   units dollars iri_key   upc                 day tag total#oftrips
1100560 1     5.989   234140    00-01-18200-00834   47  1   1
1101253 1     13.99   652159    00-03-71990-09516   251 1   1
1100685 1     20.99   652159    00-01-18200-53030   18  1   1  
1100685 1     15.99   652159    00-01-83783-37512   18  0   1


1101162 1     19.99   652159    00-01-34100-15341   206 1   2
1101162 1     19.99   652159    00-01-34100-15341   235 1   2
1101758 1     12.99   652159    00-01-18200-43381   30  1   11
1101758 1     6.989   652159    00-01-18200-16992   114 1   11
1101758 1     11.99   652159    00-02-72311-23012   121 1   11
1101758 2     21.98   652159    00-02-72311-23012   128 1   11
1101758 1     19.99   652159    00-01-18200-96550   223 1   11
1101758 1     12.99   234140    00-04-87692-29103   247 1   11
1101758 1     20.99   234140    00-01-18200-96550   296 1   11
1101758 1     12.99   234140    00-01-87692-11103   296 0   11
1101758 1     12.99   652159    00-01-87692-11103   317 1   11
1101758 1     19.99   652159    00-01-18200-96550   324 1   11
1101758 1     12.99   652159    00-02-87692-68103   352 1   11
1101758 1     12.99   652159    00-01-87692-32012   354 1   11

Bottom line, I guess, is - as long as panelist, iri_key, and day are the same, this would count as 1 trip. The total number of trips per panelists will depend on an additional number of distinct panelist, iri_key, and day combinations.

I'm not sure I fully understand. Is 1 visit then, 1 trip to the store irrespective of how many products were bought? And do you want to see the visits each person made to each store? Adding an example of how you want your data to look like, along with an explanation, would be helpful. — Roberto Ferrer
Thanks Roberto, here is an update. I wrote it before seeing your new answer. — Olga
You're welcome. So my guess was not bad. Remember that answers merit some kind of noticeable feedback on behalf of the owner of the question. See for example, stackoverflow.com/help/someone-answers. — Roberto Ferrer

Roberto Ferrer Roberto Ferrer · Accepted Answer · 2014-08-10T02:33:19

I'm not sure I understand exactly what you want, but here's my guess:

clear all
set more off

*----- example data -----

input ///
id code day store
1 1 86 1
1 1 45 1
1 3 45 1
1 3 4 4
2 1 86 1
2 1 45 1
2 3 45 1
end

format day %td

list, sepby(id)

*----- what you want? -----

egen tag = tag(id day store)

bysort id: egen totvis = total(tag)
bysort id store: egen totvis2 = total(tag)

list, sepby(id)

which will result in:

     +--------------------------------------------------------+
     | id   code         day   store   tag   totvis   totvis2 |
     |--------------------------------------------------------|
  1. |  1      3   05jan1960       4     1        3         1 |
  2. |  1      1   15feb1960       1     1        3         2 |
  3. |  1      3   15feb1960       1     0        3         2 |
  4. |  1      1   27mar1960       1     1        3         2 |
     |--------------------------------------------------------|
  5. |  2      1   15feb1960       1     1        2         2 |
  6. |  2      3   15feb1960       1     0        2         2 |
  7. |  2      1   27mar1960       1     1        2         2 |
     +--------------------------------------------------------+

This means person 1 made a total of 3 visits (considering all stores), and of those, 1 was to store 4 and 2 to store 1. Person 2 made 2 visits, both to store 1.

Keep unique IDs depending on frequency of occurrence

1 Answers