0
votes

I am trying to utilize sqldf for my large dataset. (Original data files are scattered across 1000~ separate csv files, so I've created a master csv file that contains all those information and plan to use sqldf to fetch the data as needed)

Upon trying some simple codes, none of my sql queries was returning a valid result -- all queries will return 0 observations!

I tried to run the sample code provided in "read.csv.sql" documentation, which creates a csv file with iris dataset and runs a simple sql query to fetch the data from the created csv file. I've done it in two different versions:

write.csv(iris, "iris1.csv", quote = FALSE, row.names = FALSE)
iris1 <- read.csv.sql("iris1.csv", 
                      sql = "select * from file where Species = 'setosa' ")
dim(iris1)
write_csv(iris, "iris2.csv")
iris2 <- read.csv.sql("iris2.csv", 
                      sql = "select * from file where Species = 'setosa' ")
dim(iris2)

I get following:

> dim(iris1)
[1] 50  5
> dim(iris2)
[1] 0 5

The only difference I am finding is that "iris1.csv" was made with write.csv, while "iris2.csv" was made with write_csv. To me, they appear to be the exact same csv files. What underlying difference would be the cause for this unexpected result? I would just use write.csv, but my dataset is so large that I definitely need to use write_csv to create a master db.

1
I think the problem is CSV encoding format. According to write_csv manual, the output (iris2) should be encoded in UTF-8 (as in write.csvfor iris1) but I suspect it is CSV-Macintosh which is not the best option for read.csv.sql.Majid

1 Answers

3
votes

The question did not mention platform. I can't reproduce this on Linux but can reproduce the problem on Windows. The issue on Windows is line endings. write_csv writes \n as line endings but read.csv.sql defaults to the usual line endings on the platform on which it is running -- on Windows the usual line endings are \r\n. Both write.csv and read.csv.sql have an eol argument that allows one to specify the end of line character(s) but write_csv does not (see readr issue #857) so try specifying eol = "\n" in read.csv.sql like this:

write_csv(iris, "iris2.csv")

iris2a <- read.csv.sql("iris2.csv", 
  sql = "select * from file where Species = 'setosa' ", eol = "\n")

dim(iris2a)
## [1] 50  5