Computationally non tolling algorithm to find min and max of a variable wrt a factor in r in large data frame?

Question

I have a very large data frame where some values are associated to a factor, like this:

value      user
12         USER1
4          USER5
6          USER3
50         USER1
2          USER2
1          USER1
8          USER5
9          USER3
55         USER1
15         USER2

I want to find out the max and min of the value for each user. I tried with a for loop, where I create a temp variable and find max and min there, going through the user list. However, the size of the database is quite big (100Mb) and it takes a really long time (30 mins). Is there a smarter way to do this? Thanks.

Colonel Beauvel Colonel Beauvel · Accepted Answer · 2015-01-29T17:00:19

If df is your original data.frame for "big" data it is recommended to use data.table package:

library(data.table)

dt = data.table(df)
setkey(dt, user)

dt[,list(min(value), max(value)),by=user]
    user V1 V2
1: USER1  1 55
2: USER5  4  8
3: USER3  6  9
4: USER2  2 15

Edit: good example to use each from plyr!

> library(plyr)
> dt[,as.list(each(min,max)(value)),by=user]
    user min max
1: USER1   1  55
2: USER5   4   8
3: USER3   6   9
4: USER2   2  15

Computationally non tolling algorithm to find min and max of a variable wrt a factor in r in large data frame?

2 Answers