2
votes

I am trying to solve a problem of importing xls data into R with readxl package. The specific xls file has 18 columns and 472 rows, first 7 rows have descriptive text that needs to be skipped. I only want to select col 1,3,6:9 out of the 18 columns for EDA. They have mixed types including date, numeric and text.

The readxl seems not able to import non-continous columns directly. My plan is to use skip =7 to read the entire sheet first and use select next step. However, the problem is readxl guess the date type to numeric by default. Is there a way in readxl to specify col_types by column name?

A reproducible code with example xlsx for a work around demostration.

    library(readxl)

    xlsx_example <- readxl_example("datasets.xlsx")

    # read the entire table
    read_excel(xlsx_example)

    # select specific column to name - following code does not work

    read_excel(xlsx_example, col_types=col (Sepal.Length = "numeric"))
2
@markdly You could write that up as an answer.G5W
@G5W, Thanks. Have posted previous comment as possiblle answer (I wan't sure it was complete enough to do so).markdly
@markdly, thanks. This would be an answer for excelsheet with small number of columns. To select 6 out of 18 cols, we need to put in 18 types into col_types = col (c('18 types')).Richard Yang
@RichardYang, the other way you could do it is read in all columns with col_types = "text". This will set all columns to text by default. From there you can select the relevant columns keep and convert each column to an appropriate type after import.markdly
@RichardYang see edited answer for an examplemarkdly

2 Answers

2
votes

As far as I'm aware, you are not able to specify col_types by column name. It's possible to only read in specific columns though. For example,

read_excel(xlsx_example, col_types=c("numeric", "skip", "numeric", "numeric", "skip"))

will import columns 1, 3 and 4 and skip columns 2 and 5. You could do this for the 18 columns but I think this gets a bit hard to keep track of which column is being imported as which type.

An alternative is to read in all columns as text using col_types = "text" then select and convert variables by name. For example:

library(tidyverse)
library(readxl)
xlsx_example <- readxl_example("datasets.xlsx")
df <- read_excel(xlsx_example, col_types = "text")
df %>% 
  select(Sepal.Length, Petal.Length) %>% 
  mutate(Sepal.Length = as.numeric(Sepal.Length))
#> # A tibble: 150 x 2
#>    Sepal.Length Petal.Length
#>           <dbl>        <chr>
#>  1          5.1          1.4
#>  2          4.9          1.4
#>  3          4.7          1.3
#>  4          4.6          1.5
#>  5          5.0          1.4
#>  6          5.4          1.7
#>  7          4.6          1.4
#>  8          5.0          1.5
#>  9          4.4          1.4
#> 10          4.9          1.5
#> # ... with 140 more rows
-2
votes

So I think you can do:

read_excel(xlsx_example, col_types=col (Sepal.Length = col_numeric()))