Hashing every row of a tibble

Question

I am using the newly minted dplyr 1.0.0 and the digest package to generate a hash of every row in a tibble.

I am aware of

adding hash to each row using dplyr and digest in R

but I would like to use the revamped rowwise() in dplyr 1.0.0.

See the example below. Anyone has any idea about why it fails? I should be allowed to digest a row where the entries are of different types.

library(dplyr)
library(digest)

df <- tibble(
    student_id = letters[1:4],
    student_id2 = letters[9:12],
    test1 = 10:13, 
    test2 = 20:23, 
    test3 = 30:33, 
    test4 = 40:43
)

df
#> # A tibble: 4 x 6
#>   student_id student_id2 test1 test2 test3 test4
#>   <chr>      <chr>       <int> <int> <int> <int>
#> 1 a          i              10    20    30    40
#> 2 b          j              11    21    31    41
#> 3 c          k              12    22    32    42
#> 4 d          l              13    23    33    43

dd <- df %>%
    rowwise(student_id) %>%
    mutate(hash = digest(c_across(everything()))) %>%
    ungroup
#> Error: Problem with `mutate()` input `hash`.
#> ✖ Can't combine `student_id2` <character> and `test1` <integer>.
#> ℹ Input `hash` is `digest(c_across(everything()))`.
#> ℹ The error occured in row 1.

### but digest should not care too much about the type of the input

^{Created on 2020-06-04 by the reprex package (v0.3.0)}

Do you need df %>% mutate(hash = pmap_chr(., ~ digest(c(...)))) — akrun

akrun akrun · Accepted Answer · 2020-06-03T23:37:12

It seems that the different column types have an issue. One option is to first change the column types to a single one and then do the rowwise

library(dplyr)
library(digest)
df %>%
    mutate(across(everything(), as.character)) %>% 
    rowwise %>%
    mutate(hash = digest(c_across(everything()))) 
# A tibble: 4 x 7
# Rowwise: 
#  student_id student_id2 test1 test2 test3 test4 hash                            
#  <chr>      <chr>       <chr> <chr> <chr> <chr> <chr>                           
#1 a          i           10    20    30    40    2638067de6dcfb3d58b83a83e0cd3089
#2 b          j           11    21    31    41    21162fc0c528a6550b53c87ca0c2805e
#3 c          k           12    22    32    42    8d7539eacff61efbd567b6100227523b
#4 d          l           13    23    33    43    9739997605aa39620ce50e96f1ff4f70

Or another option is to unite the columns to a single one and then do the digest on that column

library(tidyr)
df %>% 
   unite(new, everything(), remove = FALSE) %>% 
   rowwise %>%
   mutate(hash = digest(new)) %>%
   select(-new)
# A tibble: 4 x 7
# Rowwise: 
#  student_id student_id2 test1 test2 test3 test4 hash                            
#  <chr>      <chr>       <int> <int> <int> <int> <chr>                           
#1 a          i              10    20    30    40 a9e4cafdfbc88f17b7593dfd684eb2a1
#2 b          j              11    21    31    41 a67a5df8186972285bd7be59e6fdab38
#3 c          k              12    22    32    42 9c20bd87a50642631278b3e6d28ecf68
#4 d          l              13    23    33    43 3f4f373d1969dcf0c8f542023a258225

Or another option is pmap, where we concatenate the elements to a single one in each row, resulting in integer converting to character as vectors can hold only a single class

library(purrr)
df %>% 
     mutate(hash = pmap_chr(., ~ digest(c(...))))
# A tibble: 4 x 7
#  student_id student_id2 test1 test2 test3 test4 hash                            
#  <chr>      <chr>       <int> <int> <int> <int> <chr>                           
#1 a          i              10    20    30    40 f0fb4100907570ef9bda073b78dc44a6
#2 b          j              11    21    31    41 754b09e8d4d854aa5e40aa88d1edfc66
#3 c          k              12    22    32    42 5f3a699caff833e900fd956232cf61dd
#4 d          l              13    23    33    43 4d31c65284e5db36c37461126a9eb63c

The advantage here is that we are not changing the column types

Hashing every row of a tibble

1 Answers