1
votes

I have a dataset with some duplicate entries that I want to change to include only unique combinations of values, with a dup_num column to indicate the number of duplicate entries, and a dup_rows column to indicate which rows contain duplicate data.

I implemented a solution based on Finding duplicate observations of selected variables in a tibble , but it throws a mess of warnings when coercing data in the column containing the list of row numbers to a character vector. Not a problem now, but I want to show this data with DT and Shiny and the warnings are a problem for this application.

library(tidyverse)

df <- tibble(episode = 1:30,
             day = rep(c("Mon", "Wed", "Fri"), 10),
             name = rep(c(
               "Moe", "Larry", "Curly", "Shemp", "extra"
             ), 6))

chr_dups <- as_mapper( ~ str_c(.x) %>%
                         str_remove_all("[c\\(\\)]"))

df %>%
  nest(episode, .key = "dups") %>%
  mutate(dup_num = map_dbl(dups, nrow),
         dup_rows = map_chr(dups, chr_dups))
#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing

#> Warning in stri_c(..., sep = sep, collapse = collapse, ignore_null = TRUE):
#> argument is not an atomic vector; coercing
#> # A tibble: 15 x 5
#>    day   name  dups             dup_num dup_rows
#>    <chr> <chr> <list>             <dbl> <chr>   
#>  1 Mon   Moe   <tibble [2 x 1]>       2 1, 16   
#>  2 Wed   Larry <tibble [2 x 1]>       2 2, 17   
#>  3 Fri   Curly <tibble [2 x 1]>       2 3, 18   
#>  4 Mon   Shemp <tibble [2 x 1]>       2 4, 19   
#>  5 Wed   extra <tibble [2 x 1]>       2 5, 20   
#>  6 Fri   Moe   <tibble [2 x 1]>       2 6, 21   
#>  7 Mon   Larry <tibble [2 x 1]>       2 7, 22   
#>  8 Wed   Curly <tibble [2 x 1]>       2 8, 23   
#>  9 Fri   Shemp <tibble [2 x 1]>       2 9, 24   
#> 10 Mon   extra <tibble [2 x 1]>       2 10, 25  
#> 11 Wed   Moe   <tibble [2 x 1]>       2 11, 26  
#> 12 Fri   Larry <tibble [2 x 1]>       2 12, 27  
#> 13 Mon   Curly <tibble [2 x 1]>       2 13, 28  
#> 14 Wed   Shemp <tibble [2 x 1]>       2 14, 29  
#> 15 Fri   extra <tibble [2 x 1]>       2 15, 30

Created on 2019-09-19 by the reprex package (v0.3.0)

I am pretty sure that the problem is in as_mapper().

Below is a reprex with representative toy data. The tibble describes some episodes from the Three Stooges, the day the episode ran, and the character who was the protagonist for the episode.

Thanks!

3

3 Answers

3
votes

It is a warning because the list elements are not atomic, i.e. it is a list of tibble which can be identified, if we pull the column

df %>%
  nest(dups = episode)  %>% 
  pull(dups)
#<list_of<tbl_df<episode:integer>>[15]>
#[[1]]
# A tibble: 2 x 1
#  episode
#    <int>
#1       1
#2      16

#[[2]]
# A tibble: 2 x 1
#  episode
3    <int>
#1       2
#2      17
# ...

So, it is a list of tibble. either we can extract the column with pull

or we can flatten it and apply the function

library(purrr)
df %>%
   nest(dups = episode) %>%
   mutate(dup_num = map_dbl(dups, nrow), 
         dup_rows = map(dups, ~ flatten_int(.x) %>% 
                                     chr_dups))

NOTE: It is not clear why the function 'chr_dups' is applied on the 'episode' column which is numeric. The transformations are also not making sense


If we just need to paste the elements of 'episode' grouped by the other columns, a base R single line approach is

aggregate(episode~ day + name, df, toString)
#   day  name episode
#1  Fri Curly   3, 18
#2  Mon Curly  13, 28
#3  Wed Curly   8, 23
#4  Fri extra  15, 30
#5  Mon extra  10, 25
#6  Wed extra   5, 20
#7  Fri Larry  12, 27
#8  Mon Larry   7, 22
#9  Wed Larry   2, 17
#10 Fri   Moe   6, 21
#11 Mon   Moe   1, 16
#12 Wed   Moe  11, 26
#13 Fri Shemp   9, 24
#14 Mon Shemp   4, 19
#15 Wed Shemp  14, 29
2
votes

I think the source of the warning has already been addressed. I'll add that you can do this without mapping, using just vectorised functions.

library(tidyverse)

df <- tibble(episode = 1:30,
             day = rep(c("Mon", "Wed", "Fri"), 10),
             name = rep(c(
               "Moe", "Larry", "Curly", "Shemp", "extra"
             ), 6))

df %>%
  group_by(day, name) %>%
  summarise(
    dup_num = n(),
    dup_rows = str_c(episode, collapse = ", ")
  )
#> # A tibble: 15 x 4
#> # Groups:   day [3]
#>    day   name  dup_num dup_rows
#>    <chr> <chr>   <int> <chr>   
#>  1 Fri   Curly       2 3, 18   
#>  2 Fri   extra       2 15, 30  
#>  3 Fri   Larry       2 12, 27  
#>  4 Fri   Moe         2 6, 21   
#>  5 Fri   Shemp       2 9, 24   
#>  6 Mon   Curly       2 13, 28  
#>  7 Mon   extra       2 10, 25  
#>  8 Mon   Larry       2 7, 22   
#>  9 Mon   Moe         2 1, 16   
#> 10 Mon   Shemp       2 4, 19   
#> 11 Wed   Curly       2 8, 23   
#> 12 Wed   extra       2 5, 20   
#> 13 Wed   Larry       2 2, 17   
#> 14 Wed   Moe         2 11, 26  
#> 15 Wed   Shemp       2 14, 29

Created on 2019-09-19 by the reprex package (v0.3.0)

1
votes

Just adding to other posters. You don't have to use purrr to achieve what you want. Base R will do.

df <- df %>%
  nest(episode, .key = "dups") %>%
  mutate(dup_num = sapply(dups, nrow),
         dup_rows = sapply(dups, function(x) paste0(x$episode, collapse = ",")))