1
votes

I would like to replace unique values with an index number using dplyr::mutate.

I am grouping by a couple of different variables to access the appropriate subset of my dataframe.

  head(df)
       group start_time end_time
  1    group1       0   0.4
  2    group1       0   0.4
  3    group1      0   0.4
  4    group1     0.4   0.8
  5    group1     0.4   0.8
  6    group2     0.0   0.4
  7    group2     0.4   0.8
  8    group2     0.8   1.02

I group_by 'group,' and then by 'start_time.' Sometimes a given group has only one start_time, sometimes two start_times, or sometimes three. I need to create a new variable, 'idx,' for each unique start_time. But I can't think how to do it.

  new_df <- df %>% 
    group_by(group, start_time) %>% 
    mutate(idx = row_number()) %>%
    as.data.frame

Creating a new variable using row_number() isn't right. It gives me:

  idx
  1
  2
  3
  1
  2
  1
  1
  1

But I want:

  idx
  1
  1
  1
  2
  2
  1
  2
  3

I thought of replacing each unique value in group_by with a number? And repeating?

3

3 Answers

5
votes

We can use match after grouping by 'group'

library(tidyverse)
df %>% 
   group_by(group) %>%
   mutate(idx = match(start_time, unique(start_time)))
# A tibble: 8 x 4
# Groups:   group [2]
#  group  start_time end_time   idx
#  <chr>       <dbl>    <dbl> <int>
#1 group1        0       0.4      1
#2 group1        0       0.4      1
#3 group1        0       0.4      1
#4 group1        0.4     0.8      2
#5 group1        0.4     0.8      2
#6 group2        0       0.4      1
#7 group2        0.4     0.8      2
#8 group2        0.8     1.02     3

Or another option is group_indices

df %>% 
   group_split(group) %>%
   map_df(~ .x %>% 
                mutate(idx = group_indices(., start_time)))

NOTE: If the 'idx' needs to be created outside the 'group', then remove the group_by step

NOTE2: In the OP's example, both (with/without group_by) gives the same output

2
votes

We can actually do this easily using R's factor type. A factor variable is stored as integers that refer to a table of levels which holds the actual values. We can then use as.integer or as.numeric to convert from factor back to a number. When you do that, the levels table is lost and you're left with only the integers that would refer back to it; normally this is undesired (you want your actual values, not the encoded values) but in this case it's desirable since identical values will be encoded with the same number:

df <- structure(list(group = c("group1", "group1", "group1", "group1", 
"group1", "group2", "group2", "group2"), start_time = c(0, 0, 
0, 0.4, 0.4, 0, 0.4, 0.8), end_time = c(0.4, 0.4, 0.4, 0.8, 0.8, 
0.4, 0.8, 1.02)), class = "data.frame", row.names = c(NA, -8L
))

df %>%
    mutate(idx = as.integer(factor(start_time)))

   group start_time end_time idx
1 group1        0.0     0.40   1
2 group1        0.0     0.40   1
3 group1        0.0     0.40   1
4 group1        0.4     0.80   2
5 group1        0.4     0.80   2
6 group2        0.0     0.40   1
7 group2        0.4     0.80   2
8 group2        0.8     1.02   3

As an added benefit, this works just as well in base R:

df$idx <- as.integer(factor(df$start_time))
df
   group start_time end_time idx
1 group1        0.0     0.40   1
2 group1        0.0     0.40   1
3 group1        0.0     0.40   1
4 group1        0.4     0.80   2
5 group1        0.4     0.80   2
6 group2        0.0     0.40   1
7 group2        0.4     0.80   2
8 group2        0.8     1.02   3
1
votes

Another option is data.table::frank (short for fast rank)

df %>% 
   group_by(group) %>%
   mutate(idx = data.table::frank(start_time, ties.method = 'dense'))

# # A tibble: 8 x 4
# # Groups:   group [2]
#   group  start_time end_time   idx
#   <chr>       <dbl>    <dbl> <int>
# 1 group1        0       0.4      1
# 2 group1        0       0.4      1
# 3 group1        0       0.4      1
# 4 group1        0.4     0.8      2
# 5 group1        0.4     0.8      2
# 6 group2        0       0.4      1
# 7 group2        0.4     0.8      2
# 8 group2        0.8     1.02     3