1
votes

I'm working with a large time series dataset. I have multiple individuals (id) that were assayed on an hourly basis (hour) over many days (dates). However, not all individuals were observed on the same dates. I would like to create a new variable (obs) that lists the dates of each individuals from 1:n, so that each hourly assay in the same day gets the same number.

I thought I could do this easily in dplyr by using the group_by(id, date) and mutate to count the length of each id's dates, but this just replicates the 'hour' variable which I don't want.

# what i have

    id <- rep(c("id1", "id2"), each = 6)
    date <- as.Date(rep(c("2018-3-13", "2018-3-14", "2018-4-11", "2018-4-12"), each = 3))
    hour <- rep(1:3, 4)
    data.have <- data.frame(id, date, hour)

# attempt 1 - just replicates 'hour' which I don't want

    data.have %>%
      group_by(id, date) %>%
      arrange(date) %>%
      mutate(obs = 1:length(date))

# what i want

    obs <- rep(1:2, each =3, times = 2)
    data.want <- data.frame(id, date, hour, obs)

´´´
3
Speaking as someone with experience in clinical data, I would strongly advise you to introduce a relative observation date variable, i.e. number of days from day zero. Not only because of your current problem. If you did that, your task would be trivial. data.have %>% group_by(id) %>% arrange(date) %>% mutate(reldate=date - date[1])January

3 Answers

1
votes

You can do it using dense_rank and group only by ID

data.have %>% 
    group_by(id) %>% 
    mutate(obs=dense_rank(date))
0
votes

A slightly cheat-y solution would be to group by id then convert date to a factor and back to a numeric:

data.have %>%
  group_by(id) %>% 
  mutate(obs = as.numeric(as.factor(date)))
0
votes

I would take another approach for reasons which are not technical: I would introduce a relative date, number of days since the first observation:

data.have %>% group_by(id) %>% arrange(date) %>% mutate(reldate=date - date[1])

There are several reasons for that, but the most important one is that in my experience, such clinical data will need a relative date for the purposes of analysis sooner or later, so better generate it sooner than later.

Actually, this is probably the value you desire.

data.want <- data.have %>% group_by(id) %>% 
             arrange(date) %>% 
             mutate(reldate=date - date[1]) %>% 
             mutate(times=as.numeric(reldate + 1))

Assuming, of course, that the measurements were taken daily. If not, you can create the times variable by converting the reldate to factor and taking it as numeric vector.