1
votes

I have two columns with data.

One has labels for a group and a second displays values for items in each group. I would like to calculate for each group, the average of only those values that are distinct.

How can I do this in Stata?


EDIT:

See my dataset and desired result below:

Group_label   Value
   x            12
   x            12
   x            2
   x            1
   y            5
   y            5
   y            5
   y            2
   y            2

I want to generate the following average:

Group_label   Value      Average
   x            12         5
   x            12         5
   x            2          5
   x            1          5
   y            5          3.5
   y            5          3.5
   y            5          3.5
   y            2          3.5
   y            2          3.5

So the average for x = (12 + 2 + 1) / 3 and for y = (5 + 2) / 2

I have tried the egen(mean) command but it selects all values for each group label.

I only want to select the distinct values.

2

2 Answers

2
votes

This is a two-step solution. You first need to tag distinct values using tag() within egen. Then you use mean() within egen.

The most delicate point is that something like ... if tag will leave missing values in the result for observations not selected. How can you omit duplicated values from the calculation yet also spread the result to their observations? See Section 9 of this paper for the use of cond() together with mean() which is one way to do it, exemplified in the code, and perhaps the most transparent way too. See Section 10 of the same paper for another method, which amuses some people.

For a fairly detailed review of distinct observations, see https://www.stata-journal.com/sjpdf.html?articlenum=dm0042

clear 
input str1 Group_label   Value
   x            12
   x            12
   x            2
   x            1
   y            5
   y            5
   y            5
   y            2
   y            2
  end 

  egen tag = tag(Group_label Value)

  egen mean = mean(cond(tag, Value, .)), by(Group_label)

  list, sepby(Group_label)

     +-------------------------------+
     | Group_~l   Value   tag   mean |
     |-------------------------------|
  1. |        x      12     1      5 |
  2. |        x      12     0      5 |
  3. |        x       2     1      5 |
  4. |        x       1     1      5 |
     |-------------------------------|
  5. |        y       5     1    3.5 |
  6. |        y       5     0    3.5 |
  7. |        y       5     0    3.5 |
  8. |        y       2     1    3.5 |
  9. |        y       2     0    3.5 |
     +-------------------------------+
1
votes

The following works for me:

clear

input str1 vlab   val
   "x"            12
   "x"            12
   "x"            2
   "x"            1
   "y"            5
   "y"            5
   "y"            5
   "y"            2
   "y"            2
end

bysort vlab: generate tag = val != val[_n-1]
bysort vlab: egen mean_val = mean(val) if tag == 1

list

     +-----------------------------+
     | vlab   val   tag   mean_val |
     |-----------------------------|
  1. |    x    12     1          5 |
  2. |    x    12     0          . |
  3. |    x     2     1          5 |
  4. |    x     1     1          5 |
  5. |    y     5     1        3.5 |
     |-----------------------------|
  6. |    y     5     0          . |
  7. |    y     5     0          . |
  8. |    y     2     1        3.5 |
  9. |    y     2     0          . |
     +-----------------------------+

EDIT:

If you also do:

bysort vlab: replace mean_val = mean_val[_n-1] if mean_val == .

You will get:

list

     +-----------------------------+
     | vlab   val   tag   mean_val |
     |-----------------------------|
  1. |    x    12     1          5 |
  2. |    x    12     0          5 |
  3. |    x     2     1          5 |
  4. |    x     1     1          5 |
  5. |    y     5     1        3.5 |
     |-----------------------------|
  6. |    y     5     0        3.5 |
  7. |    y     5     0        3.5 |
  8. |    y     2     1        3.5 |
  9. |    y     2     0        3.5 |
     +-----------------------------+