R data.table: subsetting data.table/dataframe based on size of row value

Question

This is a basic question, but I'm stumped:

I have the following R data.table:

library(data.table)
DT <- fread('unique_point biased    data_points   team   groupID                                                                                                           
 up1          FALSE     3             A      xy28352                                                                                                                 
 up1          TRUE      4             A      xy28352                                                                                                                 
 up2          FALSE     1             A      xy28352                                                                                                                  
 up2          TRUE      0             X      xy28352                                                                                                                  
 up3          FALSE     12            Y      xy28352                                                                                                                 
 up3          TRUE      35            Z      xy28352')

which prints out as

> DT
   unique_point biased data_points team groupID
1:          up1  FALSE           3    A xy28352
2:          up1   TRUE           4    A xy28352
3:          up2  FALSE           1    A xy28352
4:          up2   TRUE           0    X xy28352
5:          up3  FALSE          12    Y xy28352
6:          up3   TRUE          35    Z xy28352

The values for the column team are letters A to Z, 26 possibilities. At the moment. If I count the row values with this code:

DT[, counts := .N, by=c("team")]

which gives

> DT
   unique_point biased data_points team groupID counts
1:          up1  FALSE           3    A xy28352      3
2:          up1   TRUE           4    A xy28352      3
3:          up2  FALSE           1    A xy28352      3
4:          up2   TRUE           0    X xy28352      1
5:          up3  FALSE          12    Y xy28352      1
6:          up3   TRUE          35    Z xy28352      1

I would like to create 26 new columns in DT which gives the size of each team, A, B, C, etc.

The resulting data.table would look like:

> DT
   unique_point biased data_points team groupID    A   B   C ... Z
1:          up1  FALSE           3    A xy28352    3   0   0 ... 1
2:          up1   TRUE           4    A xy28352    3   0   0 ... 1
3:          up2  FALSE           1    A xy28352    3   0   0 ... 1
4:          up2   TRUE           0    X xy28352    3   0   0 ... 1
5:          up3  FALSE          12    Y xy28352    3   0   0 ... 1
6:          up3   TRUE          35    Z xy28352    3   0   0 ... 1

I'm not sure how one does this with data.table syntax..

EDIT: I'm happy to do this with base R and dplyr as well.

How are going to look/use this output? Working with so many sparse column usually is a real mess. What's wrong with just sticking with the counts column you've created? — David Arenburg
@DavidArenburg I'm actually using mclapply through many data.tables to aggregate one data.table that will have each of these counts. At the moment, when I use counts from individual data.tables, there may not be a value for certain letters (e.g. in the above example, C, D, E, ...). So, I'm tried to create these sums for each individual data.table, and pass these values to the summary table after mclapply. Make sense? — ShanZhengYang

Luke C Luke C · Accepted Answer · 2017-07-18T22:40:01

What about plyr, is that ok?

library(data.table)
library(plyr)

DT <- fread('unique_point biased    data_points   team   groupID                                                                                                           
            up1          FALSE     3             A      xy28352                                                                                                                 
            up1          TRUE      4             A      xy28352                                                                                                                 
            up2          FALSE     1             A      xy28352                                                                                                                  
            up2          TRUE      0             X      xy28352                                                                                                                  
            up3          FALSE     12            Y      xy28352                                                                                                                 
            up3          TRUE      35            Z      xy28352')

ldply(LETTERS, function(x){
  n <- nrow(DT[team == as.character(x),])
  DT[, as.character(x) := n]
  return(DT[team == x,])
})

> DT
   unique_point biased data_points team groupID A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
1:          up1  FALSE           3    A xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
2:          up1   TRUE           4    A xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
3:          up2  FALSE           1    A xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
4:          up2   TRUE           0    X xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
5:          up3  FALSE          12    Y xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
6:          up3   TRUE          35    Z xy28352 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1

R data.table: subsetting data.table/dataframe based on size of row value

2 Answers