Scenario: You have a CSV file with data in sections, e.g.
[Car data]
mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
21,6,160,110,3.9,2.62,16.46,0,1,4,4
21,6,160,110,3.9,2.875,17.02,0,1,4,4
22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
18.1,6,225,105,2.76,3.46,20.22,1,0,3,1
14.3,8,360,245,3.21,3.57,15.84,0,0,3,4 ...
[Other stuff]
Forgive the formatting. I had to add extra new lines to get the block quoting to at least resemble the intended data format. I'll create a reproducible example using mtcars below and pretend we've done the easy bit of subsetting the rows we want, for example as per the motivating code quoted here:
# Import raw data:
data_raw <- readLines("test.txt")
# find separation line:
id_sep <- which(data_raw=="")
# create ranges of both data sets:
data_1_range <- 4:(id_sep-1)
data_2_range <- (id_sep+4):length(data_raw)
# using ranges and row data import it:
data_1 <- read.csv(textConnection(data_raw[data_1_range]))
data_2 <- read.csv(textConnection(data_raw[data_2_range]))
from this post. In other words, the approach we're looking at adopting is to read the data in once, as lines, find the lines we want, and then "read" them using read.csv to get a data.frame.
Okay, so the year is now 2017 and we want to embrace the tidyverse world and use read_lines in place of readLines, and read_csv in place of read.csv.
library(tidyverse)
write_csv(mtcars, "mtcars_local.csv")
# this creates an easily reproduced local file
data_raw <- readLines("mtcars_local.csv")
# henceforth assume we've found the desired rows and subsetted
data_df <- read.csv(textConnection(data_raw))
head(data_df)
mpg cyl disp hp drat wt qsec vs am gear carb
1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# whoo hoo, the above is exactly the output we want (replicating
# the original post answer)
data_raw_2 <- read_lines("mtcars_local.csv")
data_df_2 <- read_csv(textConnection(data_raw_2))
#Error in read_connection_(con) :
# Evaluation error: can only read from a binary connection.
So read_csv doesn't like taking a textConnection like read.csv did. The documentation for read_csv does say:
Arguments:
file: Either a path to a file, a connection, or literal data (either a single string or a raw vector).
So, question(s):
- Is there a neat tidyverse way of getting a particular delimited section of a CSV into a tibble? (that doesn't involve reading in the lines and subsetting as an interim step)
- Or from such a vector of strings of each line, how can you get them into a tibble?