419
votes

I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns.

Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:

 data.frame(df$A,df$B,df$E)

Is there a more compact way of doing this?

10

10 Answers

209
votes

Using the dplyr package, if your data.frame is called df1:

library(dplyr)

df1 %>%
  select(A, B, E)

This can also be written without the %>% pipe as:

select(df1, A, B, E)
476
votes

You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.

# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]

Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.

str(df["A"])
## 'data.frame':    1 obs. of  1 variable:
## $ A: int 1
str(df[,"A"])  # vector
##  int 1

Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).

# subset (original solution--not recommended)
df[,c("A","B","E")]  # returns a data.frame
df[,"A"]             # returns a vector
107
votes

This is the role of the subset() function:

> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> subset(dat, select=c("A", "B"))
  A B
1 1 3
2 2 4
82
votes

There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or

df[,c(1,2,5)]

as in

> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9)) 
> df
  A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
  A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
  A B E
1 1 3 8
2 2 4 8
19
votes

For some reason only

df[, (names(df) %in% c("A","B","E"))]

worked for me. All of the above syntaxes yielded "undefined columns selected".

19
votes

Where df1 is your original data frame:

df2 <- subset(df1, select = c(1, 2, 5))
15
votes

You can also use the sqldf package which performs selects on R data frames as :

df1 <- sqldf("select A, B, E from df")

This gives as the output a data frame df1 with columns: A, B ,E.

4
votes

You can use with :

with(df, data.frame(A, B, E))
0
votes

[ and subset are not substitutable:

[ does return a vector if only one column is selected.

df = data.frame(a="a",b="b")    

identical(
  df[,c("a")], 
  subset(df,select="a")
) 

identical(
  df[,c("a","b")],  
  subset(df,select=c("a","b"))
)
0
votes
df<- dplyr::select ( df,A,B,C)

Also, you can assign a different name to the newly created data

data<- dplyr::select ( df,A,B,C)