Suppose, you have a data.frame like this:
x <- data.frame(v1=1:20,v2=1:20,v3=1:20,v4=letters[1:20])
How would you select only those columns in x that are numeric?
EDIT: updated to avoid use of ill-advised sapply
.
Since a data frame is a list we can use the list-apply functions:
nums <- unlist(lapply(x, is.numeric))
Then standard subsetting
x[ , nums]
## don't use sapply, even though it's less code
## nums <- sapply(x, is.numeric)
For a more idiomatic modern R I'd now recommend
x[ , purrr::map_lgl(x, is.numeric)]
Less codey, less reflecting R's particular quirks, and more straightforward, and robust to use on database-back-ended tibbles:
dplyr::select_if(x, is.numeric)
Newer versions of dplyr, also support the following syntax:
x %>% dplyr::select(where(is.numeric))
Filter()
from the base package is the perfect function for that use-case:
You simply have to code:
Filter(is.numeric, x)
It is also much faster than select_if()
:
library(microbenchmark)
microbenchmark(
dplyr::select_if(mtcars, is.numeric),
Filter(is.numeric, mtcars)
)
returns (on my computer) a median of 60 microseconds for Filter
, and 21 000 microseconds for select_if
(350x faster).
The library PCAmixdata has functon splitmix that splits quantitative(Numerical data) and qualitative (Categorical data) of a given dataframe "YourDataframe" as shown below:
install.packages("PCAmixdata")
library(PCAmixdata)
split <- splitmix(YourDataframe)
X1 <- split$X.quanti(Gives numerical columns in the dataset)
X2 <- split$X.quali (Gives categorical columns in the dataset)
This doesn't directly answer the question but can be very useful, especially if you want something like all the numeric columns except for your id column and dependent variable.
numeric_cols <- sapply(dataframe, is.numeric) %>% which %>%
names %>% setdiff(., c("id_variable", "dep_var"))
dataframe %<>% dplyr::mutate_at(numeric_cols, function(x) your_function(x))