Remove columns using column name based on levels of factors

Question

I have a data frame with a column of factors and columns with values for each factor plus additional factors that are no longer included in the data frame. Example:

x <- data.frame(f= toupper(sample(letters[1:3], 5, replace=T)), 
            x.A = seq(1:5),
            x.B = seq(1:5),
            x.C = seq(1:5),
            x.D = seq(1:5),
            x.E = seq(1:5))

Resulting in:

  f x.A x.B x.C x.D x.E
1 B   1   1   1   1   1
2 B   2   2   2   2   2
3 A   3   3   3   3   3
4 C   4   4   4   4   4
5 A   5   5   5   5   5

Now I want to remove all columns that do not represent a current level in column f, resulting in a data frame:

  f x.A x.B x.C
1 B   1   1   1
2 B   2   2   2
3 A   3   3   3
4 C   4   4   4
5 A   5   5   5

Naming convention is consistent among levels and column names, and names always take the form somevariable.FACTORLEVEL. I would type all the names in a list to choose from, but it gets long and unwieldy. I tried using grep as follows:

subX <- x[x$f == 'B', grep('B', names(x))]

But don't quite get what I want and don't know how to extend that over all levels if it did work. I also looked at previous questions here and here, but they don't go as far as I need. Any help would be appreciated. Thanks.

akrun akrun · Accepted Answer · 2015-11-23T18:22:15

We use sub to remove the prefix x. from the column names of 'x', check whether it is %in% the 'f' column to create a logical vector and use this to subset the columns of 'x'. We removed the first column name (as it is 'f') and later concatenated with TRUE to include that column also in the subset.

 x[c(TRUE,sub('.*\\.', '', names(x)[-1]) %in% x$f)]

Or we can use grepl with pattern specified by pasteing the 'f' column to return a logical index as before.

x[c(TRUE,grepl(paste(x$f, collapse='|'), names(x)[-1]))]

Remove columns using column name based on levels of factors

2 Answers