How to squeeze multiple observations per person in one row with dplyr?

Question

I have multiple observations per person (panel data). But actually for a certain kind of plot, I only need one row per person and I have several characteristics which don't change at all.

Here is the data and a basic analysis:

set.seed(1234)
data = data.frame(id=rep(1:25, each=4), time=seq(1:4),
                  char1 = sample(0:10, 100, replace=T), 
                  char2 = sample(0:5, 100, replace=T),
                  yob = rep(sample(1910:2010, 25, replace=T), each=4))

data <- data %>%
  group_by(yob) %>% # year of birth
  mutate(char1_share = sum(char1==1)/sum(char1)) %>% # first characteristic
  mutate(char2_share = sum(char2==1)/sum(char2))     # second characteristic

ggplot(data = data, aes(x = yob)) + 
  geom_smooth(aes(y=char1_share, color="char1")) + 
  geom_smooth(aes(y=char2_share, color="char2"))

When I use all of my 100 observations, I mispredict these relationships, since a person is counted multiple times (my real data has a random number of person rows, e.g. 1 to 5 times). I know how to access the number of unique people:

data %>% distinct(id)

but i cannot combine this with my group-mutate structure.

Data should be collapsed in the following way:

head(data, n=8L)
# A tibble: 8 x 7
# Groups:   yob [2]
     id  time char1 char2   yob char1_share char2_share
  <int> <int> <int> <int> <int>       <dbl>       <dbl>
1     1     1     9     2  1942      0.0222       0.05 
2     1     2     5     3  1942      0.0222       0.05 
3     1     3     4     3  1942      0.0222       0.05 
4     1     4     8     2  1942      0.0222       0.05 
5     2     1     4     1  1970      0.0769       0.143
6     2     2     5     3  1970      0.0769       0.143
7     2     3     3     4  1970      0.0769       0.143
8     2     4     1     1  1970      0.0769       0.143

Squeeze all information that does not change per person, e.g. id, yob, and the shares (char1_share, char2_share). They don't change.
Forget about some other columns that are not important for this particular analysis. That's time. I can do this, by select in dplyr.
Pick the highest value ever attained for a person. So char1 and char2 have different measurements. For person 1 it should be equal to 9, for person two equal to 4 and so on.

I expect the following tibble:

# A tibble: 8 x 7
# Groups:   yob [2]
     id  char1 char2   yob char1_share char2_share
  <int>  <int> <int> <int>       <dbl>       <dbl>
1     1      9     3  1942      0.0222       0.05 
7     2      5     4  1970      0.0769       0.143

Thank you.

Update: Ideas I tried several versions of top_n and slice. It's always pretty close, like:

data %>%
  select(-time) %>%
  group_by(id) %>%
  slice(which.max(char1))

In this version I need to extend slice(which.max(char1)) char1 to "column by column".

Iaroslav Domin Iaroslav Domin · Accepted Answer · 2019-10-23T13:51:30

data %>% 
  group_by(id) %>% 
  summarise(
    char1 = max(char1),
    char2 = max(char2),
    yob   = yob[[1]],
    char1_share = char1_share[[1]],
    char2_share = char2_share[[1]]
  )

Or if it works quickly enough on you data, you can simply use

data %>% 
  group_by(id) %>% 
  summarise_all(max)

How to squeeze multiple observations per person in one row with dplyr?

2 Answers