For dummy dataset
require(data.table)
require(reshape2)
teamid <- c(1,2,3)
member <- c("a,b","","c,g,h")
leader <- c("c", "d,e", "")
dt <- data.table(teamid, member, leader)
Now the dataset looks like this:
teamid member leader 1: 1 a,b c 2: 2 d,e 3: 3 c,g,h
3 Columns. For each team, they have team members, and team leaders in different column. Teams may have only members without leaders, and vice versa.
The following is my ALMOST desired output:
teamid value leader 1: 1 a FALSE 2: 1 b FALSE 3: 1 c TRUE 4: 1 c TRUE 5: 2 d TRUE 6: 2 e TRUE 7: 3 c FALSE 8: 3 g FALSE 9: 3 h FALSE
I want to have the two columns merged into one, and add a tag if one is a team leader.
I have an ugly solution for this,
dt1 <- dt[, strsplit(member, ","), by = teamid]
dt2 <- dt[, strsplit(leader, ","), by = teamid]
setkey(dt1,teamid)
setkey(dt2,teamid)
dt3 <- merge(dt1,dt2, all = TRUE)
dt4 <- melt(dt3, id = 1, measure = c("V1.x", "V1.y"))
dt5 <- dt4[value!="NA_real"]
dt6 <- dt5[, leader := (variable == "V1.y")][, variable := NULL]
setkey(dt6, teamid)
setnames(dt6,value,member)
Issues:
This solution is not efficency I think, first merge and then melt. So any ideas about other ways to do this?
There're duplicated rows, in row 3 and row 4.
When I tried to change column name, an error came up
setnames(dt6,value,member)
Error in setnames(dt6, value, member) : object 'value' not found
Maybe the most important thing,
When I tried to test on my real dataset, which have more 1million rows, 3 columns the following error occured
merge(df1,df2, all = TRUE) Error in vecseq(f__, len__, if (allow.cartesian) NULL else as.integer(max(nrow(x), : Join results in 238797 rows; more than 142095 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including
j
and droppingby
(by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.
Any suggestion? Thanks a lot!