Dataset description:
I have a highly unbalanced panel dataset, with some unique panelist IDs appearing only once, while others appear as much as 4,900 times. Each observation reflects an alcohol purchase associated with a unique product identifier (UPC). If my panelist purchased two separate brands (hence, two different UPCs) in the same day, same store, two distinct observations are created. However, seeing that these purchases were made on the same day and same store, I could safely assume that it was just one trip. Similarly, another panelist who also has 2 observations associated with the same store BUT different days of purchase (or vice versa) is assumed to make 2 store visits.
Task:
I would like to explore qualities of those people who purchased alcohol a certain number of times in the whole period. Thus, I need to identify panelists who made only 1) 1 visit, 2) 2 visits, 3) between 5 and 10 visits, 4) between 50 and 100 visits, etc.
I started by trying to identify panelists who made only 1 visit by tagging them by panelist id, day, and store. However, the program also tags the first occurrence of those who appear twice or more.
egen tag = tag(panid day store)
I also tried collapse
but realized that it might not be the best solution because I want to keep my observations "as is" without aggregating any variables.
I will appreciate if you can provide me insight on how to identify such observations.
UPDATE:
panid units dollars iri_key upc day tag
1100560 1 5.989 234140 00-01-18200-00834 47 1
1101253 1 13.99 652159 00-03-71990-09516 251 1
1100685 1 20.99 652159 00-01-18200-53030 18 1
1100685 1 15.99 652159 00-01-83783-37512 18 0
1101162 1 19.99 652159 00-01-34100-15341 206 1
1101162 1 19.99 652159 00-01-34100-15341 235 1
1101758 1 12.99 652159 00-01-18200-43381 30 1
1101758 1 6.989 652159 00-01-18200-16992 114 1
1101758 1 11.99 652159 00-02-72311-23012 121 1
1101758 2 21.98 652159 00-02-72311-23012 128 1
1101758 1 19.99 652159 00-01-18200-96550 223 1
1101758 1 12.99 234140 00-04-87692-29103 247 1
1101758 1 20.99 234140 00-01-18200-96550 296 1
1101758 1 12.99 234140 00-01-87692-11103 296 0
1101758 1 12.99 652159 00-01-87692-11103 317 1
1101758 1 19.99 652159 00-01-18200-96550 324 1
1101758 1 12.99 652159 00-02-87692-68103 352 1
1101758 1 12.99 652159 00-01-87692-32012 354 1
Hi Roberto, thanks for the feedback. This is a small sample of the dataset. In the first part of this particular example, we can safely assume that all three ids 1100560, 1101253, and 1100685 visited a store only once, i.e. made only one transaction each. The first two panelists obviously have only one record each, and the third panelist purchased 2 different UPCs in the same store, same day, i.e. in the same transaction.
The second part of the example has two panelists - 1101162 and 1101758 - who made more than one transaction: two and eleven, respectively. (Panelist 1101758 has 12 observations, but only 11 distinct trips.)
I would like to identify an exact number of distinct trips (or transactions) panelists of my dataset made:
panid units dollars iri_key upc day tag total#oftrips
1100560 1 5.989 234140 00-01-18200-00834 47 1 1
1101253 1 13.99 652159 00-03-71990-09516 251 1 1
1100685 1 20.99 652159 00-01-18200-53030 18 1 1
1100685 1 15.99 652159 00-01-83783-37512 18 0 1
1101162 1 19.99 652159 00-01-34100-15341 206 1 2
1101162 1 19.99 652159 00-01-34100-15341 235 1 2
1101758 1 12.99 652159 00-01-18200-43381 30 1 11
1101758 1 6.989 652159 00-01-18200-16992 114 1 11
1101758 1 11.99 652159 00-02-72311-23012 121 1 11
1101758 2 21.98 652159 00-02-72311-23012 128 1 11
1101758 1 19.99 652159 00-01-18200-96550 223 1 11
1101758 1 12.99 234140 00-04-87692-29103 247 1 11
1101758 1 20.99 234140 00-01-18200-96550 296 1 11
1101758 1 12.99 234140 00-01-87692-11103 296 0 11
1101758 1 12.99 652159 00-01-87692-11103 317 1 11
1101758 1 19.99 652159 00-01-18200-96550 324 1 11
1101758 1 12.99 652159 00-02-87692-68103 352 1 11
1101758 1 12.99 652159 00-01-87692-32012 354 1 11
Bottom line, I guess, is - as long as panelist
, iri_key
, and day
are the same, this would count as 1 trip. The total number of trips per panelists will depend on an additional number of distinct panelist
, iri_key
, and day
combinations.