5
votes

I have a grid of ocean depth data by location, and am trying to interpolate depth values for a selection of GPS points.

We've been using RSAGA::pick.from.points, which works fine for small data sets.

require(RSAGA)

depthdata <- cbind.data.frame(x=c(74.136, 74.135, 74.134, 74.133, 74.132, 74.131, 74.130, 74.129, 74.128, 74.127), 
y=rep(40, times=10), 
depth=c(-0.6, -0.6, -0.9, -0.9, -0.9, -0.9, -0.9, -0.9, -0.6, -0.6))

mylocs <- rbind(c(-74.1325, 40), c(-74.1305, 40))
colnames(mylocs) <- c("x", "y")

results <- pick.from.points(data=mylocs, src=depthdata, pick=c("depth"), method="nearest.neighbour")
mydepths <- results$depth

But our depth data set contains 69 million data points, and we have 5 million GPS points that we'd like depth estimates for, and pick.from.points is just taking too long (> 2 weeks) for this data set. We think that we could accomplish this task more quickly in MATLAB or ArcMap, but we're trying to incorporate this task into a longer workflow in R that we're writing for other people to run repeatedly, so switching to proprietary software for part of that workflow is less than desirable.

We'd be willing to sacrifice some degree of accuracy for speed.

I've looked for a solution as best as I can, but I'm fairly new to grid data and interpolation, so might be using inappropriate language and therefore missing a simple solution.


1
I ran the code and I'm not exactly sure what it's doing, but maybe you could use akima::interp or fields::interp.surface for this - shadowtalker
Interesting question, but I think it might be too open-ended for StackOverflow. - Frank
I guess one approach might be to pass smaller, geographically contiguous sets of data (maybe blocks of lat-long ranges) to the interpolator in a parallel fashion (since presumably nearest.neighbor has a heckuva time finding neighbors for each point is your very large initial data set). Points near the edges of such a subset would be poorly interpolated, so your subsets would have to be overlapping, with only the "interior" points (by one mile, say?) interpolated within each. - Frank

1 Answers

6
votes

If you were willing to impute by finding the nearest neighbor and using its value, I think the trick would be to use an efficient nearest neighbors implementation that allows you to find the nearest neighbor among n alternatives in O(log(n)) time. The k-d tree provides this sort of performance, and is available through the FNN package in R. While the computation (on randomly generated data with 69 million data points for reference and 5 million data points to impute) isn't instantaneous (it takes about 3 minutes), it's much quicker than 2 weeks!

data <- cbind(x=rnorm(6.9e7), y=rnorm(6.9e7))
labels <- rnorm(6.9e7)
query <- cbind(x=rnorm(5e6), y=rnorm(5e6))

library(FNN)
get.nn <- function(data, labels, query) {
  nns <- get.knnx(data, query, k=1)
  labels[nns$nn.index]
}
system.time(get.nn(data, labels, query))
#    user  system elapsed
# 174.975   2.236 177.617

As a warning, the process peaked around 10GB of RAM, so you will need significant memory resources to run on a dataset of your size.