42
votes

The problem:

I often need to select a set of variables from a data.frame in R. My research is in the social and behavioural sciences, and it is quite common to have a data.frame with several hundreds of variables (e.g., there'll be item level information for a range of survey questions, demographic items, performance measures, etc., etc.).

As part of analyses, I'll often want to select a subset of variables. For example, I might want to get:

  • descriptive statistics for a set of variables
  • correlation matrix on a set of variables
  • factor analysis on a set of variables
  • predictors in a linear model

Now, I know that there are many ways to write the code to select a subset of variables. Quick-r has a nice overview of common ways of extracting variable subsets from a data.frame.

e.g.,

myvars <- c("v1", "v2", "v3")
newdata <- mydata[myvars]

However, I'm interested in the efficiency of this process, particularly where you might need to extract 20 or so variables from a data.frame. The naming convention of variables is often not intuitive, especially where you've inherited a dataset from someone else, so you might be left wondering, was the variable Gender, gender, sex, GENDER, gender1, etc. Multiply this by 20 variables that need to be extracted, and the task of memorising variable names becomes more complicated than it needs to be.

Concrete example

To make the following discussion concrete, I'll use the bfi data.frame in the psych package.

library(psych)
data(bfi)
df <- bfi
head(df, 1)
      A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4
61617  2  4  3  4  4  2  3  3  4  4  3  3  3  4  4  3  4  2  2  3  3  6  3  4
      O5 gender education age
61617  3      1        NA  16
  • How can I efficiently select an arbitrary set of variables, which for concreteness, I'll choose A1, A2, A3, A5, C2, C3, C5, E2, E3, gender, education, age?

My current strategy

I currently have a range of strategies that I use. Of course sometimes I can exploit things like the numeric position of the variables or the naming convention and use either grep to select or paste to construct. But sometimes I need a more general solution. I've used the following over the while:

1. names(df)

In the early days, I used to call names(df), copy the quoted variable names and then edit until I have what I want.

2. Use a database

Sometimes I'll have a separate data.frame that stores each variable as a row, and has columns for variable names, variable labels, and it has a column which indicates whether the variable should be retained for a particular analysis. I can then filter on that include variable and extract a vector of variable names. I find this particularly useful when I'm developing a psychological test and for various iterations I want to include or exclude certain items.

3. dput(names(df))

As Hadley Wickham once pointed out to me dput is a good option; e.g., dput(names(df)) is better than names(df) in that it outputs a list that is already in the c("var1", "var2", ...) format:

dput(names(df))
c("A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5", 
"E1", "E2", "E3", "E4", "E5", "N1", "N2", "N3", "N4", "N5", "O1", 
"O2", "O3", "O4", "O5", "gender", "education", "age")

This can then be copied into the script and edited.

But can it be more efficient

I guess dput is a pretty good variable selection strategy. The efficiency of the process largely depends on how proficient you are in copying the text into your script and then editing the list of names down to those desired.

However, I still remember the efficiency of GUI based systems of variable selection. For example, in SPSS when you interact with a dialogue box you can point and click with the mouse the variables you want from the dataset. You can shift-click to select a range of variables, you can hold shift and press the down key to select one or more variables, and so on. And then you can press Paste and the command with extracted variable names is pasted into your script editor.

So, finally the core question

  • Is there a simple no frills GUI device that permits the selection of variables from a data.frame (e.g., something like guiselect(df) opens a gui window for variable selection), and returns a vector of variable names selected c("var1", "var2", ...)?
  • Is dput the best general option for selecting a set of variable names in R? Or is there a better way?

Update (April 2017): I have posted my own understanding of a good strategy below.

5
If you work on Windows with Office, an option would be to save your data.frame to a csv file using write.csv(), open it with Excel, delete the columns you don't want, save, then read it back into R using read.csv(). I hope people come up with better solutions...flodel
edit(df), delete what you don't want?blindjesse
@blindJesse I rarely want to actually modify the data.frame in these contexts. In a typical set of analyses I'll be extracting 20 or 30 different subsets of variables for different analyses. It's also important to me that any changes are recorded in my script so that the results are reproducible.Jeromy Anglim
I love this question! I've just been putting up with typing out the names (didn't even know about the dput thing). It never even occurred to me to hope that there would be an R-ish way of doing it.Chris Beeley
@JeromyAnglim. How are you approaching this nowadays? It's 2017 and I still have to type var1 + var2 + var3 + .... var158 within regsubsets( ) in R. I still think it's easier to do this type of analysis in JMP and SPSS.Dan

5 Answers

26
votes

I'm personally a fan of the myvars <- c(...) and then using mydf[,myvars] from there on in.

However this still requires you to enter the initial variable names (even though just once), and as far as I read your question, it is this initial 'picking variable names' that is what you're asking about.

Re a simple no-frills GUI device -- I've recently been introduced to the menu function, which is exactly a simple no-frills GUI device for selecting one object out of a list of choices. Try menu(names(df),graphics=TRUE) to see what I mean (returns the column number). It even gives a nice text interface if for some reason your system can't do the graphics (try with graphics=FALSE to see what I mean).

However this is of limited use to you, as you can only select one column name. To select multiple, you can use select.list (mentioned in ?menu as the alternative to make multiple selections):

# example with iris data (I don't have 'psych' package):
vars <- select.list(names(iris),multiple=TRUE,
                    title='select your variable names',
                    graphics=TRUE)

This also takes a graphics=TRUE option (single click on all the items you want to select). It returns the names of the variables.

10
votes

You could use select.list(), like this:

DF <- data.frame(replicate(26,list(rnorm(5))))
names(DF) <- LETTERS
subDF <- DF[select.list(names(DF), multiple=TRUE)]
5
votes

I use the following strategy to make variable selection in R efficient.

Use metadata to store variable names

I have data frames with one row per variable for certain sets of variables. For example, I might have a 100 item personality test. The meta data includes the variable name in R along with all the scoring information (e.g., should the item be reversed and so on). I can then extract variable names for the items and the scale names from this meta data.

Store variable sets in a named list

In every project, I have a list called v that stores named sets of variables. Then in any analysis that requires a set of variables, I can just refer to the named list. This also makes code more reliable, because if the variable names change so do all your contingent analyses. It is also good for creating consistency in how variables are ordered.

Here's a simple example:

v <- list()
v$neo_items <- meta.neo$id
v$ds14_items <- meta.ds14$id
v$core_items <- c(v$neo_items, v$ds14_items)       

v$typed_scales <- c("na", "si")
v$typed_all <- c("typed_continuous_sum", "na", "si")
v$neo_facets <- sort(unique(meta.neo$facet))
v$neo_factors <- c("agreeableness", "conscientiousness", 
                   "extraversion", "neuroticism", "openness")
v$outcomes_scales <- c("healthbehavior", "socialsupport", 
                "physical_symptoms", "psychological_symptoms")

A few points can be seen from the above example:

  • Often the variable lists will be generated from meta data that I have stored separately. So for example, I have the variable names for the 240 itms of the neo personality test stored in meta.neo$id
  • In some cases, variable names can be derived from meta data. For example, one of the columns in my meta-data for a personality test indicates which scale the item belongs to, and the variable names are derived from that column by taking the unique value of that column.
  • In some cases, variable sets are the combination of smaller sets. So for example, you might have one set for predictors, one set for outcomes, and one set that combines predictors and outcomes. The division into predictors and outcomes might be useful for some regression models, and the combined set might be useful for a correlation matrix or a factor analysis.
  • For more ad hoc lists of variables, I still use dput(names(df) where df is my data.frame to generate the vector of character names that is then stored in a variable list.
  • These variable lists are generally placed after you load your data, but before you munge it. That way, they can be used for data preparation, and they are certainly available when you start running analyses (e.g., predictive models, correlations, descriptive statistics, etc.).
  • The beauty of variable lists is that you can readily use auto-copmlete in RStudio. So you don't need to remember variable names or even the names of the variable lists. You just type v$ and press tab or v$ and some part of the list name.

Using variables lists

Using variable lists is fairly straight forward, but some functions in R specify variable names differently.

The simple and standard scenario involves supplying the list of variable names to the data.frame subset. For example,

cor(data[,v$mylist])
cor(data[,v$predictors], data[,v$outcomes])

It is a little bit trickier for functions that require formulas. You may need to write a function. For example:

v <- list()
v$predictors <- c("cyl", "disp")
f <- as.formula(paste("mpg ~", paste(v$predictors, collapse = " + ")))
lm(f, mtcars)

You can also use variable lists in functions like sapply and lapply (and presumably the tidyverse equivalents). For example,

Create a descriptive statistics table with:

sapply(mydata[, v$outcomes], function(X) c(mean = mean(X), sd = sd(X)))

dput is still useful

For ad hoc variables or even when you are just writing the code to create a variable list, dput is still very useful.

The standard code is dput(names(df)) where df is your data.frame. So for example:

 dput(names(mtcars))

Produces

 c("mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", 
 "gear", "carb")

You can then edit this string to extract the variables you need. This has the additional benefit that it reduces typing errors in your code. And this is a really important point. You don't want to spend lots of time trying to debug code that was merely a result of a typo. Furthermore, Rs error message when mistyping a variable name is awful. It just says "undefined columns selected". It doesn't tell you which variable names were wrong.

If you have a large number of variables, you can also use a range of string search functions to extract a subset of the variable names:

For example

> library(psych)
> dput(names(bfi)) #all items
c("A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5", 
"E1", "E2", "E3", "E4", "E5", "N1", "N2", "N3", "N4", "N5", "O1", 
"O2", "O3", "O4", "O5", "gender", "education", "age")
> dput(grep("^..$", names(bfi), value = TRUE)) # two letter variable names
c("A1", "A2", "A3", "A4", "A5", "C1", "C2", "C3", "C4", "C5", 
"E1", "E2", "E3", "E4", "E5", "N1", "N2", "N3", "N4", "N5", "O1", 
"O2", "O3", "O4", "O5")
> dput(grep("^E.$", names(bfi), value = TRUE)) # E items
c("E1", "E2", "E3", "E4", "E5")
> dput(grep(".5$", names(bfi), value = TRUE)) # 5th items
c("A5", "C5", "E5", "N5", "O5")

Clean existing variable names and use a naming convention

When I get a data file from someone else, the variable names often lack conventions or use conventions that make working with the variables less useful in R. A few rules that I use:

  • make all variables lower case (having to think about lower and upper case variables is just annoying)
  • make variable names intrinsically meaningful (some other software uses variable labels to store meaningful data; R doesn't really use labels)
  • Keep variables to an appropriate length (i.e., not too long). Up to 10 characters is fine. More than 20 gets annoying.

All these steps generally make variable selection easier because there are fewer inconsistencies to remember.

Use tab completion for individual variable names

For individual variables, I generally use auto-completion from the data frame. E.g., df$ and press tab.

I try to use a coding style that allows me to use auto-completion as much as possible. I don't like functions that require me to know the variable name without using auto-completion. For example, when subsetting a data.frame, I prefer

df[ df$sample == "control", ]

to

subset(df, sample == "control")

because I can autocomplete the variable name "sample" in the top example, but not in the second.

3
votes

If you want a method that ignores the case of variables and perhaps picks out variables on the basis of their 'stems' then use the appropriate regex pattern and ignore.case-=TRUE and value=TRUE with grep:

 dfrm <- data.frame(var1=1, var2=2, var3=3, THIS=4, Dont=5, NOTthis=6, WANTthis=7)
unlist(sapply( c("Want", "these", "var"),
   function(x) grep(paste("^", x,sep=""), names(dfrm), ignore.case=TRUE, value=TRUE) ))
#----------------
      Want       var1       var2       var3   # Names of the vector
"WANTthis"     "var1"     "var2"     "var3"   # Values matched
> dfrm[desired]
  WANTthis var1 var2 var3
1        7    1    2    3
0
votes

Do you mean select?

sub_df = subset(df, select=c("v1","v2","v3"))