1
votes

For dummy dataset

require(data.table)
require(reshape2)
teamid <- c(1,2,3)
member <- c("a,b","","c,g,h")
leader <- c("c", "d,e", "")
dt <- data.table(teamid, member, leader)

Now the dataset looks like this:

   teamid member leader
1:      1    a,b      c
2:      2           d,e
3:      3  c,g,h  

3 Columns. For each team, they have team members, and team leaders in different column. Teams may have only members without leaders, and vice versa.

The following is my ALMOST desired output:

   teamid value leader
1:      1     a  FALSE
2:      1     b  FALSE
3:      1     c   TRUE
4:      1     c   TRUE
5:      2     d   TRUE
6:      2     e   TRUE
7:      3     c  FALSE
8:      3     g  FALSE
9:      3     h  FALSE

I want to have the two columns merged into one, and add a tag if one is a team leader.

I have an ugly solution for this,

dt1 <- dt[, strsplit(member, ","), by = teamid] dt2 <- dt[, strsplit(leader, ","), by = teamid] setkey(dt1,teamid) setkey(dt2,teamid) dt3 <- merge(dt1,dt2, all = TRUE) dt4 <- melt(dt3, id = 1, measure = c("V1.x", "V1.y")) dt5 <- dt4[value!="NA_real"] dt6 <- dt5[, leader := (variable == "V1.y")][, variable := NULL] setkey(dt6, teamid) setnames(dt6,value,member)

Issues:

  1. This solution is not efficency I think, first merge and then melt. So any ideas about other ways to do this?

  2. There're duplicated rows, in row 3 and row 4.

  3. When I tried to change column name, an error came up

setnames(dt6,value,member)

Error in setnames(dt6, value, member) : object 'value' not found

Maybe the most important thing,

When I tried to test on my real dataset, which have more 1million rows, 3 columns the following error occured

merge(df1,df2, all = TRUE) Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in 238797 rows; more than 142095 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including j and dropping by (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

Any suggestion? Thanks a lot!

3

3 Answers

2
votes

Melt first.

result <- melt(dt,id="teamid", variable.name="status", value.name="member")
result <- result[nchar(member)>0,strsplit(member,","),by=list(teamid,status)]
setnames(result,"V1","member")
setkey(result,teamid,status)
result
#    teamid status member
# 1:      1 member      a
# 2:      1 member      b
# 3:      1 leader      c
# 4:      2 leader      d
# 5:      2 leader      e
# 6:      3 member      c
# 7:      3 member      g
# 8:      3 member      h

If you want to get rid of the status column and add a "tag" to the member column, you can do it this way:

result[status=="leader",member:=paste0(member,"*")]
result[,status:=NULL]
result
#    teamid member
# 1:      1      a
# 2:      1      b
# 3:      1     c*
# 4:      2     d*
# 5:      2     e*
# 6:      3      c
# 7:      3      g
# 8:      3      h
0
votes

A slightly simpler approach may be

crew <- dt[, .(strsplit(member, ","))]
crew <- unlist(crew)
leads <- dt[, .(strsplit(leader, ","))]
leads <- unlist(leads)

dt_long <- data.table(people=c(crew, leads), 
    status = rep(c("crew", "leader"), c(length(crew), length(leader))))

It gives me

  people status
1:      a   crew
2:      b   crew
3:      c   crew
4:      g   crew
5:      h   crew
6:      c leader
7:      d leader
8:      e leader
0
votes

You can try a tidyverse solution now

dt %>% 
  separate_rows(member) %>% 
  separate_rows(leader) %>% 
  gather(status, member, -teamid) %>% 
  distinct() %>% 
  filter(member != "") %>% 
  mutate(member=ifelse(status == "leader", paste0(member, "*"), member)) %>% 
  select(-status)
  teamid member
1      1      a
2      1      b
3      3      c
4      3      g
5      3      h
6      1     c*
7      2     d*
8      2     e*