0
votes

I am trying to make an interactive Sankey with the package. I have a dataset with eight columns.

df <- read.csv(header = TRUE, as.is = TRUE, text = '
clientcode,year1,year2,year3,year4,year5,year6,year7
1,DBC,DBBC,DBBC,DBC,DBC,"Not in care","Not in care"
2,DBC,DBBC,DBBC,"Not in care","Not in care","Not in care","Not in care"
3,DBC,DBBC,"Not in care","Not in care","Not in care","Not in care","Not in care"
4,DBC,DBBC,"Not in care","Not in care","Not in care","Not in care","Not in care"
5,DBC,DBBC,DBBC,"Not in care","Not in care","Not in care","Not in care"
')

I am using the code below in this post starting with "This question comes up a lot...": https://stackoverflow.com/a/52237151/4389763

This is the code I have:

df <- df %>% select(year1,year2,year3,year4,year5,year6,year7) 

links <-
df %>%
mutate(row = row_number()) %>%
gather('column', 'source', -row) %>%
mutate(column = match(column, names(df))) %>%
group_by(row) %>%
arrange(column) %>%
mutate(target = lead(source)) %>%
ungroup() %>%
filter(!is.na(target))

links <-
links %>%
mutate(source = paste0(source, '_', column)) %>%
mutate(target = paste0(target, '_', column + 1)) %>%
select(source, target)

nodes <- data.frame(name = unique(c(links$source, links$target)))

links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1
links$value <- 1

nodes$name <- sub('_[0-9]+$', '', nodes$name)

library(networkD3)
library(htmlwidgets)

sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
          Target = 'target', Value = 'value', NodeID = 'name')

But I don't know how to add the value of the flow. For example from DBC to DBBC occurs five times in year1 to year2. And DBBC to DBBC occurs three times from year2 to year3. With the code above I see every occurance as 1 and I would like to see the total value of a flow.

Like this example of a Sankey. Where you can see the total of for example group_A to group_C and not every occurance.

And is it possible to see the percentages in the mouse over? For example Year1 = DBC to Year2 = DBBC value is 5 out of 5 and percentage is 100%.

Can someone help me? Thank you.

2
please provide a minimal reproducible example - CJ Yetman
also, you will have to explain what the weight means... as it is, you have one value for weight per row, but each row has multiple links... if you want that to give the "value" for each link, then you're missing a bunch of data - CJ Yetman
Thanks for your reaction. I have changed the question and added an example. I hope you can help me. @CJYetman - SuGer
Please also show the code that you have already tried - CJ Yetman
@CJYetman, Ok I added the code. - SuGer

2 Answers

0
votes

I have changed the code:

Instead of:

links$value <- 1

The new code:

links <- links %>% group_by(source, target) %>% tally()
names(links)[3] <- "value"
0
votes

The first part of your question--how to get a dataset of links (source and target columns) from a dataset that has multiple links/edges defined on each row across several columns--is sufficiently answered by the answer that you linked to (with the minor addition that you start with an extra column, clientcode, which does not contain link information, so it needs to be removed first).

df <- read.csv(header = TRUE, as.is = TRUE, text = '
clientcode,year1,year2,year3,year4,year5,year6,year7
1,DBC,DBBC,DBBC,DBC,DBC,"Not in care","Not in care"
2,DBC,DBBC,DBBC,"Not in care","Not in care","Not in care","Not in care"
3,DBC,DBBC,"Not in care","Not in care","Not in care","Not in care","Not in care"
4,DBC,DBBC,"Not in care","Not in care","Not in care","Not in care","Not in care"
5,DBC,DBBC,DBBC,"Not in care","Not in care","Not in care","Not in care"
')

library(dplyr)
library(tidyr)

links <-
  df %>%
  select(-clientcode) %>% 
  mutate(row = row_number()) %>%
  gather('column', 'source', -row) %>%
  mutate(column = match(column, names(df))) %>%
  group_by(row) %>%
  arrange(column) %>%
  mutate(target = lead(source)) %>%
  ungroup() %>%
  filter(!is.na(target)) %>%
  mutate(source = paste0(source, '_', column)) %>%
  mutate(target = paste0(target, '_', column + 1)) %>%
  select(source, target)

links

# # A tibble: 30 x 2
#    source target       
#    <chr>  <chr>        
#  1 DBC_2  DBBC_3       
#  2 DBC_2  DBBC_3       
#  3 DBC_2  DBBC_3       
#  4 DBC_2  DBBC_3       
#  5 DBC_2  DBBC_3       
#  6 DBBC_3 DBBC_4       
#  7 DBBC_3 DBBC_4       
#  8 DBBC_3 Not in care_4
#  9 DBBC_3 Not in care_4
# 10 DBBC_3 DBBC_4       
# # ... with 20 more rows

The second part of your question is essentially, with a dataset of individual links, how can I aggregate like links into one link with a value column indicating how many individual links were aggregated into that one link. That can be achieved by grouping the source and target columns and summarizing with the count of rows.

links %>% 
  group_by(source, target) %>% 
  summarise(value = n())

# # A tibble: 11 x 3
# # Groups:   source [?]
#    source        target        value
#    <chr>         <chr>         <int>
#  1 DBBC_3        DBBC_4            3
#  2 DBBC_3        Not in care_4     2
#  3 DBBC_4        DBC_5             1
#  4 DBBC_4        Not in care_5     2
#  5 DBC_2         DBBC_3            5
#  6 DBC_5         DBC_6             1
#  7 DBC_6         Not in care_7     1
#  8 Not in care_4 Not in care_5     2
#  9 Not in care_5 Not in care_6     4
# 10 Not in care_6 Not in care_7     4
# 11 Not in care_7 Not in care_8     5

Since you want to display the percentage, not the count, you can modify that slightly to calculate the percentage of all links in each year, and then use the unit = "%" parameter of sankeyNetwork so that it displays properly.

links <- 
  links %>% 
  group_by(source, target) %>% 
  summarise(value = n() / nrow(df) * 100)

links

# # A tibble: 11 x 3
# # Groups:   source [?]
#    source        target        value
#    <chr>         <chr>         <dbl>
#  1 DBBC_3        DBBC_4           60
#  2 DBBC_3        Not in care_4    40
#  3 DBBC_4        DBC_5            20
#  4 DBBC_4        Not in care_5    40
#  5 DBC_2         DBBC_3          100
#  6 DBC_5         DBC_6            20
#  7 DBC_6         Not in care_7    20
#  8 Not in care_4 Not in care_5    40
#  9 Not in care_5 Not in care_6    80
# 10 Not in care_6 Not in care_7    80
# 11 Not in care_7 Not in care_8   100

nodes <- data.frame(name = unique(c(links$source, links$target)))

links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1

nodes$name <- sub('_[0-9]+$', '', nodes$name)

library(networkD3)
library(htmlwidgets)

sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
              Target = 'target', Value = 'value', NodeID = 'name', 
              units = "%")

enter image description here