0
votes

I'm trying to parse some data read in from Excel worksheets and, because they are in non-standard rows, I need to use grep or which to find the index of the starting row for my data but I'm getting oddly different results depending on whether I use a tibble directly out of read_excel or convert to a data frame.

I've produced a quick reproducible example:

test_vec<-c("SUMMARY OF PRICE FORECASTS, INFLATION and EXCHANGE RATES                                     ",
"*********************************************************************************************",
"NA                                                                                           ",
"NA                                                                                           ",
"NA                                                                                           ",
"Year                                                                                         ",
"1989 Act                                                                                     ",
"1990 Act") 

This is a sample from the first 7 rows of one of the data files, and I need to identify the row which contains "Year".

If the data are stored in a tibble:

test_df<-tibble(test_vec)
grepl("Year",test_df[,1]) 
grepl("Year",test_df$test_vec) 

Then I get diverging results depending on whether I index the column or use the column name:

> test_df<-tibble(test_vec)
>    grepl("Year",test_df[,1]) 
[1] TRUE
>    grepl("Year",test_df$test_vec) 
[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

Now, the same thing with an addition to convert to data frame:

test_df<-data.frame(test_vec,stringsAsFactors = F)
>    grepl("Year",test_df[,1]) 
[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
>    grepl("Year",test_df$test_vec) 
[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

And this holds if I run through tibble and then data frame too.

test_df<-data.frame(tibble(test_vec),stringsAsFactors=F)
>    grepl("Year",test_df[,1]) 
[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
>    grepl("Year",test_df$test_vec) 
[1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

Column names will not be constant in the data, so I can work around this by renaming the first column and indexing by that if I need to, like this:

test_df<-tibble(test_vec)
colnames(test_df)[1]<-"fixed_name"
grepl("Year",test_df$fixed_name) 
> [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

But, I guess I don't understand why the [,1] operates differently in the tibble than in the data frame. Any help would be appreciated.

1
This is a key feature of tibbles. If you index with $ you always return a vector, if you index with [ you always return a tibble. This is in contrast to data frames where [ can sometimes return a vector, sometimes a data frame as you see here. Print test_df[,1] and test_df$test_vec. See the tibble vignette for more.Calum You
@CalumYou, please post this as an answer ...Ben Bolker
Thanks, @CalumYou! That gets me halfway there. Is there a less clumsy work-around to do the equivalent of test_df$1 to get the vector return of the first column from a tibble?Andrew Leach
test_df %>% pull(1) ?Ben Bolker
@BenBolker yeah, grepl("Year",test_df %>% pull(1)) works.Andrew Leach

1 Answers

1
votes

This is a key feature of tibbles, as described in the tibble vignette. Let's illustrate:

library(tibble)
test_vec <- c("SUMMARY OF PRICE FORECASTS, INFLATION and EXCHANGE RATES                                     ","*********************************************************************************************","NA                                                                                           ","NA                                                                                           ","NA                                                                                           ","Year                                                                                         ","1989 Act                                                                                     ","1990 Act") 
test_tbl <- tibble(test_vec)
test_df <- data.frame(test_vec, stringsAsFactors = F)

If you index with $ you always return a vector for both tibbles and data.frames:

class(test_tbl$test_vec)
#> [1] "character"
class(test_df$test_vec)
#> [1] "character"

But if you index with [, a tibble always returns a tibble whereas a data.frame can return a vector. Specifically, it simplifies a one-column output to a vector.

class(test_tbl[, 1])
#> [1] "tbl_df"     "tbl"        "data.frame"
class(test_df[, 1])
#> [1] "character"

If you only know column indexes and not the names, and you know you only want to return one column, you can use [[ to return a vector from both dataframes and tibbles. If you are using dplyr, pull is the same as [[ for local data.

class(test_tbl[[1]])
#> [1] "character"
class(test_df[[1]])
#> [1] "character"
class(dplyr::pull(test_tbl, 1))
#> [1] "character"

Created on 2019-08-09 by the reprex package (v0.3.0)