I have multiple observations per person (panel data). But actually for a certain kind of plot, I only need one row per person and I have several characteristics which don't change at all.
Here is the data and a basic analysis:
set.seed(1234)
data = data.frame(id=rep(1:25, each=4), time=seq(1:4),
char1 = sample(0:10, 100, replace=T),
char2 = sample(0:5, 100, replace=T),
yob = rep(sample(1910:2010, 25, replace=T), each=4))
data <- data %>%
group_by(yob) %>% # year of birth
mutate(char1_share = sum(char1==1)/sum(char1)) %>% # first characteristic
mutate(char2_share = sum(char2==1)/sum(char2)) # second characteristic
ggplot(data = data, aes(x = yob)) +
geom_smooth(aes(y=char1_share, color="char1")) +
geom_smooth(aes(y=char2_share, color="char2"))
When I use all of my 100 observations, I mispredict these relationships, since a person is counted multiple times (my real data has a random number of person rows, e.g. 1 to 5 times). I know how to access the number of unique people:
data %>% distinct(id)
but i cannot combine this with my group-mutate structure.
Data should be collapsed in the following way:
head(data, n=8L)
# A tibble: 8 x 7
# Groups: yob [2]
id time char1 char2 yob char1_share char2_share
<int> <int> <int> <int> <int> <dbl> <dbl>
1 1 1 9 2 1942 0.0222 0.05
2 1 2 5 3 1942 0.0222 0.05
3 1 3 4 3 1942 0.0222 0.05
4 1 4 8 2 1942 0.0222 0.05
5 2 1 4 1 1970 0.0769 0.143
6 2 2 5 3 1970 0.0769 0.143
7 2 3 3 4 1970 0.0769 0.143
8 2 4 1 1 1970 0.0769 0.143
- Squeeze all information that does not change per person, e.g.
id
,yob
, and the shares (char1_share
,char2_share
). They don't change. - Forget about some other columns that are not important for this particular analysis. That's
time
. I can do this, byselect
in dplyr. - Pick the highest value ever attained for a person. So
char1
andchar2
have different measurements. For person 1 it should be equal to 9, for person two equal to 4 and so on.
I expect the following tibble:
# A tibble: 8 x 7
# Groups: yob [2]
id char1 char2 yob char1_share char2_share
<int> <int> <int> <int> <dbl> <dbl>
1 1 9 3 1942 0.0222 0.05
7 2 5 4 1970 0.0769 0.143
Thank you.
Update: Ideas I tried several versions of top_n
and slice
. It's always pretty close, like:
data %>%
select(-time) %>%
group_by(id) %>%
slice(which.max(char1))
In this version I need to extend slice(which.max(char1))
char1 to "column by column".