In general, R is really not meant for in-place file editing, and I'm aware of no (currently available) tools that support it in any context. Even unixy-tools like sed
do fast editing but still do not do it technically "in-place" (even if it hides how it does it). (There might be some that do, but perhaps not with the ease-of-access that you want.)
There is one notable exception, a file format that is designed for in-place editing (well, interaction). It includes significant in-place addition, filtering, replacement, and deletion operators. And for the most part, it generally does this without needing to increase the file size while doing so. It's SQLite.
For example,
library(DBI)
# library(RSQLite) # don't need to load it, just need to have it available
fname <- "./iris.sqlite3"
con <- dbConnect(RSQLite::SQLite(), fname)
file.info(fname)$size
# [1] 0
dbWriteTable(con, "iris", iris)
# [1] TRUE
file.info(fname)$size
# [1] 16384
dbGetQuery(con, "select * from iris where [Sepal.Length]=4.7 and [Sepal.Width]=3.2 and [Petal.Length]=1.6")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 4.7 3.2 1.6 0.2 setosa
file.info(fname)$size
# [1] 16384
dbExecute(con, "update iris set [Species]='virginica' where [Sepal.Length]=4.7 and [Sepal.Width]=3.2 and [Petal.Length]=1.6")
# [1] 1
dbGetQuery(con, "select * from iris where [Sepal.Length]=4.7 and [Sepal.Width]=3.2 and [Petal.Length]=1.6")
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 4.7 3.2 1.6 0.2 virginica
dbDisconnect(con)
file.info(fname)$size
# [1] 16384
Pros
- It is cross-platform. It is more prolific than most people realize, being a required internal component of the Firefox browser and the Android OS. (Many others, too.)
- Additionally, drivers exist in most programming languages, including R, python, ruby, and many too many to list here.
- There is really no practical limits to the amount of data that can be stored in a single SQLite file. It theoretically supports up to 140TB (https://www.sqlite.org/whentouse.html), though if you're getting this large, there are many (justified) arguments towards different solutions.
- Pulling data is built on the SQL standard, and though it is not 100% compliant, it is pretty darn close. Query time/performance is predicated on your query size, but generally pretty fast (ref: Will SQLite performance degrade if the database size is greater than 2 gigabytes?)
- In fact, it can be faster than individual file operations.
Cons
- There will be "overhead" in the file size. Notable is that
iris
takes under 7K of memory (see object.size(iris)
) yet the filesize starts at 16K. With larger data, the gap ratio (file size to actual data) will shrink. (I did the same things with ggplot2::diamonds
; the object is 3456376 bytes and the file size is 3780608, less than 10% larger.)
- The file size will increase when SQLite deems necessary. This is based on many factors outside the scope of R and this question/answer.
- If you delete a lot of data, the file size does not immediately reduce to accommodate that ... see change sqlite file size after "DELETE FROM table" (hint:
vacuum
)
- There are many tools that will easily/immediately import data from this file format, but notably absent is Excel and Access. It is feasible with SQLite-ODBC, but takes a little elbow-grease to do it. (I'm fine with it, but not all users will be, and some enterprise networks make this step difficult or specifically disallowed.)
SQLite-file-as-CSV
You can treat it like a file when importing, if you want to import all of it:
con <- dbConnect(RSQLite::SQLite(), fname)
iris2 <- dbGetQuery(con, "select * from iris")
dbDisconnect(con)
as compared to
iris2 <- read.csv("iris.csv", stringsAsFactors = FALSE)
If you want to get fancy:
import_sqlite <- function(fname, tablename = NA) {
if (length(tablename) > 1L) {
warning("the condition has length > 1 and only the first element will be used")
tablename <- tablename[[1L]]
}
con <- DBI::dbConnect(RSQLite::SQLite(), fname)
on.exit(DBI::dbDisconnect(con), add = TRUE)
available_tables <- DBI::dbListTables(con)
if (length(available_tables) == 0L) {
stop("no tables found")
} else if (is.na(tablename)) {
if (length(available_tables) == 1L) {
tablename <- available_tables
}
}
if (tablename %in% available_tables) {
tablename <- DBI::dbQuoteIdentifier(con, tablename)
qry <- sprintf("select * from %s", tablename)
out <- tryCatch(list(data = DBI::dbGetQuery(con, DBI::SQL(qry)),
err = NULL),
error = function(e) list(data = NULL, err = e))
if (! is.null(out$err)) {
stop("[sqlite error] ", out$err$message)
} else {
return(out$data)
}
} else {
stop(sprintf("table %s not found", DBI::dbQuoteIdentifier(con, tablename)))
}
}
head(import_sqlite("iris.sqlite3"))
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
(I'm not offering that function as anything other than a proof-of-concept that you can interact with the single file as if it were a CSV. There are some safeguards in there but is really just a hack for the sake of this question.)