3
votes

I'm looking for way to generate dummy variables that separate given categories into all possible grouping combinations. For example, if we have three categories (say A, B and C), there are five possible groupings:

Three groups: A / B / C
Two groups: A&B / C
Two groups: A&C / B
Two groups: A / B&C
One group: A&B&C

Then dummy variable for each grouping would be output to different columns of a data frame. So the final output I want looks like the following table:

sample_num  category    grouping1   grouping2   grouping3   grouping4   grouping5
                        A; B; C     A&B; C      A&C; B      A; B&C      A&B&C
-----------+---------+------------+-----------+-----------+-----------+----------
      1         A           1           1           1           1           1
      2         A           1           1           1           1           1
      3         A           1           1           1           1           1
      4         A           1           1           1           1           1
      5         B           2           1           2           2           1
      6         B           2           1           2           2           1
      7         B           2           1           2           2           1
      8         C           3           2           1           2           1
      9         C           3           2           1           2           1
     10         C           3           2           1           2           1
     11         C           3           2           1           2           1
     12         C           3           2           1           2           1
1
Your final output is not clear - what goes in what category?thelatemail
I edited out all the portions asking for package suggestions as this is one of the reasons questions may get closed. If you don't like this you can revert the changes.Tyler Rinker
Thank you. I'm new to this site and I somehow cancelled your edit. Trying to bring them back.Tsubasa Iwabuchi
@mnel - the numbers relate to the index of the category letter in each grouping - see my edit.thelatemail
@thelatemail -- I see. Perhaps A&B should be A|B.mnel

1 Answers

2
votes

The model.matrix function in the stats package (loaded by default) will construct "dummy variables" although not of the sort you describe. The first argument is an R "formula":

>dat <- read.table(text="sample_num  category 
+       1         A      
+       2         A      
+       3         A      
+       4         A      
+       5         B      
+       6         B      
+       7         B      
+       8         C      
+       9         C      
+      10         C      
+      11         C      
+      12         C", header=TRUE)
> model.matrix( ~category, data=dat)

   (Intercept) categoryB categoryC
1            1         0         0
2            1         0         0
3            1         0         0
4            1         0         0
5            1         1         0
6            1         1         0
7            1         1         0
8            1         0         1
9            1         0         1
10           1         0         1
11           1         0         1
12           1         0         1
attr(,"assign")
[1] 0 1 1
attr(,"contrasts")
attr(,"contrasts")$category
[1] "contr.treatment"

I (strongly) suspect your four-column group of dummies must be linearly dependent and one of them would get rejected by the regression functions. Other contrast arguments are possible. You should study:

?model.matrix
?contrasts

This is sum-contrasts with no intercept:

> model.matrix(~category+0, data=dat, contrasts = list(category = "contr.sum"))
   categoryA categoryB categoryC
1          1         0         0
2          1         0         0
3          1         0         0
4          1         0         0
5          0         1         0
6          0         1         0
7          0         1         0
8          0         0         1
9          0         0         1
10         0         0         1
11         0         0         1
12         0         0         1
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$category
[1] "contr.sum"

If you want to look at automatic calculation of varying levels of interaction, you will need three variables, rather than one variable with three levels:

> dat <- expand.grid(A=letters[1:3], B=letters[4:6], C=letters[7:9])
> str(model.matrix( ~ A*B*C))
Error in str(model.matrix(~A * B * C)) : 
  error in evaluating the argument 'object' in selecting a method for function 'str': Error in model.frame.default(object, data, xlev = xlev) : 
  invalid type (closure) for variable 'C'
> str(model.matrix( ~ A*B*C, data=dat))
 num [1:27, 1:27] 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:27] "1" "2" "3" "4" ...
  ..$ : chr [1:27] "(Intercept)" "Ab" "Ac" "Be" ...
 - attr(*, "assign")= int [1:27] 0 1 1 2 2 3 3 4 4 4 ...
 - attr(*, "contrasts")=List of 3
  ..$ A: chr "contr.treatment"
  ..$ B: chr "contr.treatment"
  ..$ C: chr "contr.treatment"

model.matrix( ~ A*B*C, data=dat)

omitted output