1
votes

I have several big rasters in GeoTIFF format.

Dimensions are 34565, 116908, 4 040 925 020 (nrow, ncol, ncell) of each (perfect overlap), values are of different types: integer, float...

How can I write these raster values into a table or database using R (or other software), so later I can analyze it using Spark, python or R?

I will need to process several rasters, so ideally output table look like:

row     column     raster1.value  raster2.value  raster3.value
1       1          56             76             100
1       2          18             45             89
...     ...        ...            ...            ...

34656   116908     23             39             43

I have assess to computing facility with 32 cores and 128 Gb of RAM. So, parallel computing is also possible.

I will be very thankful for your help!

1
Thanks for your answer! someone deleted his comment I need the data to be in a table format, because (1) I want to use data.table package from R with its' pass-by-reference properties (or, maybe, even Spark); (2) I need to subset records based on a key (indexing), what I cannot do efficiently, if the data is in a raster format; (3) working with tables, matrices, databases is much faster than with rasters. About long-wide format. I prefer wide format as I want to do calculation for each cell using values from all rasters (e.g., raster1.value + raster2.value / raster3.value).Olga Danylo
I deleted the comment after re-reading your question: I think it's clear as it is. Just one question, are your rasters single or multi-band?, also, are they named in such a way that their order can be determined from the name alone? cheersshekeine
Your operations would be much more efficient if done in the raster package. I understand your indexing need but there is capacity to do this in raster as well (eg., x[i,j] where i=row, j=col). The way you are thinking about the data is very inefficient and would produce a absurdly large array. When you get into data this large the advantages of using something like data.table are lost. Additionally, you may have plenty of RAM but you still need to account for operating on the data and not just persisting it.Jeffrey Evans
@JeffreyEvans Wouldn't the raster vs data.table consideration depend on the nature of the operations??. Also, you say "When you get into data this large the advantages of using something like data.table are lost", am hooked, could you explain a bit more or at least give a reference? cheers.shekeine
@Shekeine, The data will be used later for building machine learning model, so working with the rasters is not an option, you did the right guess. To be more flexible with the tools I wanna use, I really need to have that data in one big table. Answering to your first comment, all rasters are single band. So, actually, the task is to rewrite the raster matrices into a column of one data table.Georg Schnabel

1 Answers

1
votes
#Load libs
library(doParallel)
library(parallel)
library(foreach)
library(data.table)

Supposing you have n rasters, located in myras/,

#List of paths to rasters
raspaths <- list.files('myras', pattern='.tif$', full.names=T)

#Register cluster for parallel processing: Cores to use: all except 1
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl, cores=detectCores() - 1)

#Raster to datatable in parallel: one raster per thread
dtlist <- foreach (ras_id=raspaths, .packages=c('raster', 'data.table'), .combine='c') %dopar% {

          #Read all rasters into one big stack
          ras <- raster(ras_id)

          #get column and row indices
          ridx <- rowFromCell(object=ras, cell=1:ncell(ras))
          cidx <- colFromCell(object=ras, cell=1:ncell(ras))

          #Convert to data.frame then to data.table (slowest part, perhaps someone here knows a better way?)
          dt <- data.table(as.data.frame(ras))

          #Set key
          setkey(dt)

          #Add row and column info
          dt[, c('row', 'column'):=list(ridx, cidx)]

          #Some column ordering
          setcolorder(x=dt, c("row", "column", names(dt)[1]))
          list(dt)
}
stopCluster(cl)

#Bind all per-raster datatables into one big table
big_dt <- rbindlist(dtlist, fill=T, use.names=T)

#Write to disk as comma separated text file which can then be read into any Database e.g. Postgresql
write.csv(x=dt, file='mybigtable.csv', row.names=F)

Writing to csv as the final step is not the only option. With RPostgresql, you can also export "big_dt" straight into a relation in a local postgres instance..