R: read.csv.sql from sqldf is able to successfully read one csv but not another

Question

I have a dataset which is around 20GB big and therefore, I am not able to read it into an R dataframe without running out of memory. After reading some posts here, I have decided to use read.csv.sql into a database. The code I used is:

read.csv.sql("jobs.csv", sql = "CREATE TABLE Jobs2 AS SELECT * FROM file", dbname = "Test1.sqlite")

When I run the following:

sqldf("select * from Jobs2",dbname="Test1.sqlite")

I get the heading of the columns, but then no value: <0 rows> (or 0-length row.names)

But when I try the same with a csv I created using the iris dataset, everything works fine.

What am I missing here?

Thanks in advance.

G. Grothendieck G. Grothendieck · Accepted Answer · 2015-04-06T16:33:42

sqldf is primarily intended to process data frames so it creates databases and database tables transparently and removes them after completing the sql. Thus your first statement would not be expected to work since sqldf would remove the database after the statement completes.

If the SQL creates the database or table rather than sqldf itself then sqldf won't know about it so it won't delete it. Here we create the database using attach and the table using create table to fool sqldf. In the last line it won't remove the database oir table because they were already there before that line started and it never removes objects it did not create:

library(sqldf)

read.csv.sql("jobs.csv", sql = c("attach 'test1.sqlite' as new", 
              "create table new.jobs2 as select * from file"))
sqldf("select * from jobs2", dbname = "test1.sqlite")

The other thing that might go wrong would be line endings. Typically sqldf can figure it out but if not you may have to specify the eol character. The need to specify it might occur, for example, if you were trying to read a file created on one operating system in another operating system. See FAQ 11. Why am I having difficulty reading a data file using SQLite in the sqldf README.

Note: read.csv.sql is normally used to just read in a portion of the data. For example, this skips the first 100 rows and then reads columns a and b from the next 1000 rows but the query can be arbitrarily complex since you have all of SQLite's SQL to use:

read.csv.sql("jobs.csv", sql = "select a, b from file limit 1000 offset 100")

The entire file is read into a temporary sqlite database but only the requested portion is ever read into R so the entire file could be larger than R can handle.

Typically if one is trying to achieve persistence one uses RSQLite directly rather than sqldf.

R: read.csv.sql from sqldf is able to successfully read one csv but not another

1 Answers