Subset data frame based on first letters of column name

Question

I have a large dataframe with multiple columns representing different variables that were measured for different individuals. The name of the columns always start with a number (e.g. 1:18). I would like to subset the df and create separete dfs for each individual. Here it is an example:

x <- as.data.frame(matrix(nrow=10,ncol=18))
colnames(x) <- paste(1:18, 'col', sep="")

The column names of my real df is a composition of the Individual ID, the variable name, and the number of the measure (I took 3 measures of each variable). So for instance I have the measure b (body) for individual 1, then in the df I would have 3 columns named: 1b1, 1b2, 1b3. In the end I have 10 different regions (body, head, tail, tail base, dorsum, flank, venter, throat, forearm, leg). So for each individual I have 30 columns (10 regions x 3 measures per region). So I have multiple variables starting with the different numbers and I would like to subset then based on their unique numbers. I tried using grep:

partialName <- 1
df2<- x[,grep(partialName, colnames(x))]
colnames(x)
[1] "1col" "2col" "3col" "4col" "5col" "6col" "7col" "8col" "9col" "10col" 
"11col" "12col" "13col" "14col" "15col" "16col" "17col" "18col"

My problem here as you can see it doesn't separate the individuals because 1 and 10 are in the subset. In other words this selects everybody that starts with 1. Ultimately what I would like to do is to loop over all my individuals (1:18), creating new dfs for each individual.

Welcome to SO. Could you provide the colnames and clarify what are you expecting? — llrs
Thanks for the reply Llopis. Ok. In the example above my colnames are: [1] "1col" "2col" "3col" "4col" "5col" "6col" "7col" "8col" "9col" "10col" "11col" "12col" "13col" "14col" "15col" "16col" "17col" "18col" I would like to return diferent dfs subsetting the columns in which their names start with the same number. In the example above I would have 18 new dfs, each with 1 column, 1col, 2col, 3cols and so forth. Is that clear? — costagc
@user3293163 edit these into the question, not post as a comment. — Paul Hiemstra
If your column names were in the format 1_var1, 2_var1, 12_var10 then you could use melt to get the data in long format and then use colsplit to split the column names into 2 variables for the individual and the measure. Subsequently you can cast the data and/or split it as required. — Matt Weller
@user3293163 what do you real colnames look like, and do you have any control over them or are simply a given. — Paul Hiemstra

Paul Hiemstra Paul Hiemstra · Accepted Answer · 2014-02-10T15:15:33

I think keeping the data in one data.frame is the best option here. Either that, or put it into a list of data.frame's. This makes it easy to extract summary statistics per individual much easier.

First create some example data:

df = as.data.frame(matrix(runif(50 * 100), 100, 50), stringsAsFactors = FALSE)
names_variables = c('spam', 'ham', 'shrub')
individuals = 1:100
column_names = paste(sample(individuals, 50), 
                     sample(names_variables, 50, TRUE), 
                     sep = '')
colnames(df) = column_names

What I would do first is use melt to cast the data from wide format to long format. This essentially stacks all the columns in one big vector, and adds an extra column telling which column it came from:

library(reshape2)
df_melt = melt(df)
head(df_melt)
  variable      value
1    85ham 0.83619111
2    85ham 0.08503596
3    85ham 0.54599402
4    85ham 0.42579376
5    85ham 0.68702319
6    85ham 0.88642715

Then we need to separate the ID number from the variable. The assumption here is that the numeric part of the variable is the individual ID, and the text is the variable name:

library(dplyr)
df_melt = mutate(df_melt, individual_ID = gsub('[A-Za-z]', '', variable),
                          var_name = gsub('[0-9]', '', variable))

essentially removing the part of the string not needed. Now we can do nice things like:

mean_per_indivdual_per_var = summarise(group_by(df_melt, individual_ID, var_name), 
                                       mean(value))
head(mean_per_indivdual_per_var)
  individual_ID var_name mean(value)
1            63     spam   0.4840511
2            46      ham   0.4979884
3            20    shrub   0.5094550
4            90      ham   0.5550148
5            30    shrub   0.4233039
6            21      ham   0.4764298

Subset data frame based on first letters of column name

2 Answers