I'm trying to parse some data read in from Excel worksheets and, because they are in non-standard rows, I need to use grep or which to find the index of the starting row for my data but I'm getting oddly different results depending on whether I use a tibble directly out of read_excel or convert to a data frame.
I've produced a quick reproducible example:
test_vec<-c("SUMMARY OF PRICE FORECASTS, INFLATION and EXCHANGE RATES ",
"*********************************************************************************************",
"NA ",
"NA ",
"NA ",
"Year ",
"1989 Act ",
"1990 Act")
This is a sample from the first 7 rows of one of the data files, and I need to identify the row which contains "Year".
If the data are stored in a tibble:
test_df<-tibble(test_vec)
grepl("Year",test_df[,1])
grepl("Year",test_df$test_vec)
Then I get diverging results depending on whether I index the column or use the column name:
> test_df<-tibble(test_vec)
> grepl("Year",test_df[,1])
[1] TRUE
> grepl("Year",test_df$test_vec)
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
Now, the same thing with an addition to convert to data frame:
test_df<-data.frame(test_vec,stringsAsFactors = F)
> grepl("Year",test_df[,1])
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
> grepl("Year",test_df$test_vec)
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
And this holds if I run through tibble and then data frame too.
test_df<-data.frame(tibble(test_vec),stringsAsFactors=F)
> grepl("Year",test_df[,1])
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
> grepl("Year",test_df$test_vec)
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
Column names will not be constant in the data, so I can work around this by renaming the first column and indexing by that if I need to, like this:
test_df<-tibble(test_vec)
colnames(test_df)[1]<-"fixed_name"
grepl("Year",test_df$fixed_name)
> [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
But, I guess I don't understand why the [,1] operates differently in the tibble than in the data frame. Any help would be appreciated.
$
you always return a vector, if you index with[
you always return a tibble. This is in contrast to data frames where[
can sometimes return a vector, sometimes a data frame as you see here. Printtest_df[,1]
andtest_df$test_vec
. See the tibble vignette for more. – Calum Youtest_df %>% pull(1)
? – Ben Bolker