How to manipulate large `RasterStack` object and write all raster grid in plain-text data in R?

Question

I have faced several challenges when I am dealing with very large RasterStack object in R. Here is the main story, I have downloaded gridded data from European Climate Assessment website(download site of gridded data and download link of gridded data that I am interested in). So my very first step was importing this data in R as RasterStack object. Then I intended to crop raster grid of only particular countries, so I used raster::crop to do that. My ultimate goal is to calculate the annual average temperature for each grid cell. Here is the grid coverage that I cropped from original raw RasterStack object where grid resolution is defined as 0.25-degree resolution:

Here is the R script that I took a shot:

library(raster)
library(ncdf4)
library(R.utils)
library(maptools)

raw_netCDF = raster::stack("~/tg_0.25deg_reg_1995-2010_v17.0.nc")     # read downloaded gridded data in R
data(wrld_simpl) 
Germany <- wrld_simpl[wrld_simpl@data$NAME == "Germany",]
deu_ext <- extent(Germany)
Germany_ <- crop(raw_netCDF, deu_ext)

but above cropped solution Germany_ raised a challenge. The first challenge is treating missing values in large RasterStack object. If I didn't treat missing values in large RasterStack object, in newly produced cropped raster grid, all missing values were turned into zero, which cause confusion to read temperature observation such as zero Celsius degree. So I treated missing values in large RasterStack object in two different way. The first one is down below:

raw_netCDF_ = raster::reclassify(raw_netCDF , cbind(NA, -999))

but raster::reclassify always failed because of memory problem. so it is not good solution. I tried raster::calc to treat missing values in very large RasterStack object but it is extremely slow even I run the same operation in a powerful computer. So using raster::calc to treat missing values is really not a good idea. Here is R script down below

raw_netCDF_  = raster::calc(raw_netCDF , function(x) { ifelse(is.na(x), -999, x) })

I want to do simple statistics, to calculate an annual mean temperature for each grid cell for whole grid coverage above, and produce its output in clean and simple plaintext data. In final raster grid data in plain text only contain grid coordinate and it's annual mean temperature. Doing such operation for RasterStack object is not an ordinary task for me.

Perhaps, there must be a possible optimal solution to properly manipulate very large RasterStack object and make sure all missing values in original raw data can be correctly preserved in the cropped raster grid of Germany.

Desired output:

In exported plain text data, I want to have annual mean Temp for whole Germany grid for 16 years something like this:

> ann_mean_temp_1996_1999
        long    lat net_1996_precip net_1997_temp net_1997_temp net_1998_temp net_1999_temp net_2000_temp
   1:  6.125 47.375      84.4         86.4         83.4         81.4         80.4         87.4
   2:  6.375 47.375      89.3         88.3         84.3         81.3         846.3         846.3
   3:  6.625 47.375      80.0         85.0         80.0         83.0         88.0         87.0
   4:  6.875 47.375      84.4         83.4         85.4         86.4         82.4         80.4
   5:  7.125 47.375      83.0         85.0         84.0         89.0         83.0         84.0
  ---                                                                                               
1112: 13.875 54.875      63.8         68.8         66.8         67.8         65.8         66.8
1113: 14.125 54.875      69.6         65.6         61.6         60.6         62.6         63.6
1114: 14.375 54.875      60.5         61.5         62.5         67.5         69.5         64.5
1115: 14.625 54.875      62.9         67.9         68.9         67.9         64.9         68.9
1116: 14.875 54.875      64.6         67.6         66.6         62.8         64.6         63.5

If manipulating very large RasterStack object in R is possible, how can I obtain expected raster grid data with correct resolution (missing values would be properly treated) and apply simple statistics for all daily temperature observation for each grid? How can I make this happen? Is that possible to manipulate RasterStack object and write all raster grid data in plain text data (ASCII or csv) in R? Any efficient way to get this task done? Any more thoughts? Thanks

Val Val · Accepted Answer · 2018-05-07T15:14:47

I would object to your notion that this is a "very large" RasterStack, but besides that I think what you want to do should be straight forward.

So first I load and crop the data to the extent of Germany:

library(raster)
library(ncdf4)
library(R.utils)
library(maptools)



r <- stack('tg_0.25deg_reg_1995-2010_v17.0.nc')

data(wrld_simpl) 

Germany <- wrld_simpl[wrld_simpl@data$NAME == "Germany",]

r_crop <- crop(r,Germany)

#Let's take a look:

plot(r_crop[[1]])
plot(Germany,add=T)

The boundary shape isn't particularly pretty, but it does the job. Also, you can see that in the north, the values with NoData are still properly indicated as such:

r_crop[[1]][1,1]
# NA

In the next steps, I just use the layer names to extract the years, and then use lapply to calculate the means for each year:

nms <- names(r_crop)
yrs <- unique(sub('X(\\d+).+','\\1',nms))

yrs[1]
# [1] "1995"

annual_means <- lapply(yrs,function(x) mean(r_crop[[grep(x,nms)]],na.rm=TRUE))

This will give you a list called annual_means with a raster per year, representing the annual mean for that year. Now you can either stack them back together (with do.call(stack,annual_means)), process them individually, or as you probably want to do is write them to disk as csv:

# first take a look

plot(annual_means[[1]])

# write to disk

write.table(as.matrix(annual_means[[1]]),'ANNUAL_MEAN_TEMP_1995.csv',quote = F,row.names = F,col.names = F,sep = ';')

Edit:

annual_means is a list with a raster per element representing the mean temperature calculated from the daily observations of the original dataset. So the list will have as many elements as there are years.

The write.table example above was only shown for one of these years, meaning if that is the output you would like, you would need to replicate the step for all elements of the list.

What the write.table step does, is just converting the raster to a matrix, and writing it to disk. The result will be a matrix with as many rows and columns as the raster itself, with each cell separated by a semicolon (my personal preference).

Edit2:

Just to illustrate my comments from above:

You have 16 years of data, as seen in the yrs vector:

yrs
 #[1] "1995" "1996" "1997" "1998" "1999" "2000" "2001" "2002" "2003" "2004"
#[11] "2005" "2006" "2007" "2008" "2009" "2010"

Now, annual_means is a list of length 16, with a single layer raster per year, which is the mean for the entire year calculated for the whole of Germany from the daily data.

Here's an example output:

annual_means[[1]]
# class       : RasterLayer 
# dimensions  : 31, 37, 1147  (nrow, ncol, ncell)
# resolution  : 0.25, 0.25  (x, y)
# extent      : 5.75, 15, 47.25, 55  (xmin, xmax, ymin, ymax)
# coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
# data source : in memory
# names       : layer 
# values      : 3.329288, 11.32734  (min, max)

As you can see the raster has a resolution of 0.25 degree (which is the original resolution of your data), that results in a raster with 31 rows and 37 columns covering Germany.

To get your desired output:

I'll first name the list entries with the respective years, to make it a bit more visible (you could skip this):

names(annual_means) <- yrs

Now I will extract the coordinates for each raster and create a dataframe with the values (using lapply to iterate over the list):

result <- lapply(annual_means, function(x) data.frame(long = coordinates(x)[,1],lat = coordinates(x)[,2],temp_mean =x[]))

Now we can inspect the top of the dataframe for e.g. year 2000:

head(result$`2000`)

#   long    lat  temp_mean
# 1 5.875 54.875       NaN
# 2 6.125 54.875       NaN
# 3 6.375 54.875       NaN
# 4 6.625 54.875       NaN
# 5 6.875 54.875       NaN
# 6 7.125 54.875       NaN

As you can see, the first pixels are all NoData (just like seen in the plot), which is what you want.

So in the end, result is a list with each element being a dataframe for a specific year, containing the columns long, lat and temp_mean.

To 100% replicate your desired output, one could now loop again over the result list to change the temp_mean column name to a year specific one (this is totally optional):

for (i in seq_along(result)){

  colnames(result[[i]])[3] <- paste0('Net_',names(result)[i],'_Temp')

}

Giving you:

head(result$`2000`)

#    long    lat  Net_2000_Temp
# 1 5.875 54.875            NaN
# 2 6.125 54.875            NaN
# 3 6.375 54.875            NaN
# 4 6.625 54.875            NaN
# 5 6.875 54.875            NaN
# 6 7.125 54.875            NaN

Edit3:

To obtain one dataframe with all the means, you can do this:

ann_mean_temp_1996_1999 <- cbind(result[[1]][,1:2],do.call(cbind,lapply(result,function(x) x[,3])))

colnames(ann_mean_temp_1996_1999)[3:ncol(ann_mean_temp_1996_1999)]<- unlist(lapply(result,function(x) colnames(x)[3]))

The first lapply binds the long/lat (which doesn't change for all years) together with the 3rd column of every list item (which is the T-MEAN).

The second lapply extracts and assigns the column names again for the temperatures, which seem to get lost in the process. There's probably a more elegant solution for this than using lapply two times, but it does the job.

How to manipulate large `RasterStack` object and write all raster grid in plain-text data in R?

1 Answers