3
votes

How do I summarize a data.table with unreliable data across multiple columns?

Specifically, given

fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
                 behavior=c(rep(FALSE,5),rep(TRUE,5)),
                 country=c(rep(1,4),rep(2,6)),
                 language=c(rep(6,6),rep(5,4)),
                 event=1:10, key=c("user",fields))
dt
#     user behavior country language event
#  1:    3    FALSE       1        6     1
#  2:    3    FALSE       1        6     2
#  3:    3    FALSE       1        6     3
#  4:    3    FALSE       1        6     4
#  5:    3    FALSE       2        6     5
#  6:    4     TRUE       2        5     7
#  7:    4     TRUE       2        5     8
#  8:    4     TRUE       2        5     9
#  9:    4     TRUE       2        5    10
# 10:    4     TRUE       2        6     6

I want to get

#    user behavior country.name country.support language.name language.support
# 1:    3    FALSE            1             0.8             6              1.0
# 2:    4     TRUE            2             1.0             5              0.8

(here the x.name is the most common x for the user and x.support is the share events where this top x was observed)

without having to go through both fields by hand like this:

users <- dt[, sum(behavior) > 0, by=user] # have behavior at least once
setnames(users, "V1", "behavior")
dt.out <- dt[, .N, by=list(user,country)
             ][, list(country[which.max(N)],max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"),  paste0("country",c(".name", ".support")))
users <- users[dt.out]
dt.out <- dt[, .N, by=list(user,language)
             ][, list(language[which.max(N)], max(N)/sum(N)), by=user]
setnames(dt.out, c("V1", "V2"),  paste0("language",c(".name", ".support")))
users <- users[dt.out]
users
#    user behavior country.name country.support language.name language.support
# 1:    3    FALSE            1             0.8             6              1.0
# 2:    4     TRUE            2             1.0             5              0.8

The actual number of fields is 5 and I want to avoid having to repeat the same code for each field separately, and have to edit this function if I ever modify fields. Please note that this is the substance of this question, the support computation was kindly explained to me elsewhere.

As in the referenced question, my data set has about 10^7 rows, so I really need a solution that scales; it would also be nice if I could avoid unnecessary copying like in users <- users[dt.out].

2
could you explain the caluclation of x.support - Ricardo Saporta
@RicardoSaporta: I added a few words on what support is, but you have to look at the referenced question for more details. - sds
You can substitute dt[, sum(behavior) > 0, by=user] with dt[,any(behavior),by=user]. You should also explain, why you want to apply to multiple columns programmatically. Either you have just a few columns, then it's probably not worth bothering. Or you have many columns, then it might be better to reshape your data.table to long format. - Roland
Sam, you are here asking for help and you are being obnoxious and rude. I wonder if that's the best way to elicit others to spend their time on your troubles. - Ricardo Saporta
@Arun: no, English is not my native tongue. I guess you are right, those words were rude. Sorry. - sds

2 Answers

5
votes

Does this solve your problem?

fields <- c("country","language")
dt <- data.table(user=c(rep(3, 5), rep(4, 5)),
           behavior=c(rep(FALSE,5),rep(TRUE,5)),
           country=c(rep(1,4),rep(2,6)),
           language=c(rep(6,6),rep(5,4)),
           event=1:10, key=c("user",fields))

CalculateSupport <- function(dt, name) {
  x <- dt[, .N, by = eval(paste0('user,', name))]
  setnames(x, name, 'name')
  x <- x[, list(name[which.max(N)], max(N)/sum(N)), by = user]
  setnames(x, c('V1', 'V2'), paste0(name, c(".name", ".support")))
  x
}

users <- dt[, sum(behavior) > 0, by=user] 
setnames(users, "V1", "behavior")

Reduce(function(x, name) x[CalculateSupport(dt, name)], fields, users)

results in

   user behavior country.name country.support language.name language.support
1:    3    FALSE            1             0.8             6              1.0
2:    4     TRUE            2             1.0             5              0.8

P.S. Please please take Ricardo's comment to your question seriously. SO is full of wonderful people who are willing to help but you have to treat them nicely and with respect.

1
votes

I can't do it in one expression, since I am not sure how to reuse a created field in a data.table expression. It's also probably not the most efficient way. Maybe this will make a good starting point, though.

#Find most common country and language for each user
summ.dt<-dt[,list(behavior.summ=sum(behavior)>0,
     country.name=dt[user==.BY[[1]],.N,by=country][N==max(N),country],
     language.name=dt[user==.BY[[1]],.N,by=language][N==max(N),language]),
by=user]

#Get support for each country and language for each user
summ.dt[,c("country.support","language.support"):=list(
     nrow(dt[user==.BY[[1]] & country==country.name])/nrow(dt[user==.BY[[1]]]),
     nrow(dt[user==.BY[[1]] & language==language.name])/nrow(dt[user==.BY[[1]]])
),by=user]

    user behavior.summ country.name language.name country.support language.support
1:    3         FALSE            1             6             0.8              1.0
2:    4          TRUE            2             5             1.0              0.8