0
votes

I have a number of points and would like to assign an ID based on the grid cell they intersect with. I know using the sf package you could use st_intersections or st_intersects to do this. However, for large datasets with millions of features/points this can take a long time (or just cause R to crash).

Having a grid system on a projected CRS (British National Grid in my example) means there is a consistent shape and the coordinates (xmin, ymin, xmax and ymax) will always provide the full extent for every cell. Is there a non-spatial method using something in tidyverse or data.table for example that can take advantage of this to assign grid IDs.

Here is my sample data (and if anyone can show me a cleaner way of extracting the xmin, ymin, xmax and ymax for each grid cell that would be appreciated as well):

library(sf)
library(dplyr)

BBox <- st_bbox(c(xmin = 0, xmax = 10000, ymax = 10000, ymin = 0), crs = st_crs(27700))

Grid  <- st_as_sfc(BBox) %>%
         st_make_grid(square = TRUE, cellsize = c(1e3, 1e3)) %>%
         cbind(data.frame(ID = sprintf(paste("GID%0",nchar(length(.)),"d",sep=""), 1:length(.)))) %>%
         st_sf()
             
Points <- st_sample(st_as_sfc(BBox), 3000, exact = TRUE) %>%
                    st_sf('ID' = seq(length(.)), 'geometry' = .) %>%
                    mutate(X = st_coordinates(.)[,1],
                    Y = st_coordinates(.)[,2])      
             
Table <- NULL
for(i in 1:nrow(Grid)) { 
Row <- cbind(as.numeric(st_bbox(Grid[i,])[1]),
    as.numeric(st_bbox(Grid[i,])[2]),
    as.numeric(st_bbox(Grid[i,])[3]),
    as.numeric(st_bbox(Grid[i,])[4]))
Table <- as.data.frame(rbind(Table, Row))
}
names(Table) <- c("xmin","ymin","xmax","ymax")

Grid <- cbind(Grid,Table)
1
I would suggest to stick with spatial approach. If you are against memory limits you can offload your data to a PostgreSQL database (it supports spatial data formats very nicely) and run the spatial join in database = transform the memory constraint to disk space constraint. It may take a while, but you are unlikely to run out of disk...Jindra Lacko

1 Answers

1
votes

I would (strongly) advise to go with a spatial approach like the 'default' sf::st_join(), where you can easily join ID's from the grid onto the points

reproducible sample data

library(sf)
library(dplyr)
set.seed(123)  #<<-- !!

BBox <- st_bbox(c(xmin = 0, xmax = 10000, ymax = 10000, ymin = 0), crs = st_crs(27700))

Grid  <- st_as_sfc(BBox) %>%
  st_make_grid(square = TRUE, cellsize = c(1e3, 1e3)) %>%
  cbind(data.frame(ID = sprintf(paste("GID%0",nchar(length(.)),"d",sep=""), 1:length(.)))) %>%
  st_sf()

Points <- st_sample(st_as_sfc(BBox), 3000, exact = TRUE) %>%
  st_sf('ID' = seq(length(.)), 'geometry' = .) %>%
  mutate(X = st_coordinates(.)[,1],
         Y = st_coordinates(.)[,2]) 

 

sf-solution

ans.sf <- Points %>% st_join( Grid, join = st_intersects )
# Simple feature collection with 6 features and 4 fields
# geometry type:  POINT
# dimension:      XY
# bbox:           xmin: 455.565 ymin: 1835.024 xmax: 9404.673 ymax: 9425.39
# projected CRS:  OSGB 1936 / British National Grid
#   ID.x        X        Y   ID.y                  geometry
# 1    1 2875.775 2058.269 GID023 POINT (2875.775 2058.269)
# 2    2 7883.051 9425.390 GID098  POINT (7883.051 9425.39)
# 3    3 4089.769 3793.238 GID035 POINT (4089.769 3793.238)
# 4    4 8830.174 6262.401 GID069 POINT (8830.174 6262.401)
# 5    5 9404.673 1835.024 GID020 POINT (9404.673 1835.024)
# 6    6  455.565 6592.076 GID061  POINT (455.565 6592.076)

But if you run into memory problemens, and are working with simple polygons (like a grid), the following might work (again, spatial solution is (imo) always preferred!!):

data.table solution non-spatial!!!!

library( data.table )
#easy for ppints, only one xY per row
DT.points <- as.data.table( st_set_geometry( Points, NULL ) )
#    ID        X        Y
# 1:  1 2875.775 2058.269
# 2:  2 7883.051 9425.390
# 3:  3 4089.769 3793.238
# 4:  4 8830.174 6262.401
# 5:  5 9404.673 1835.024
# 6:  6  455.565 6592.076
# ...

#a bit more complcated for polygons
Grid.coords <- Grid %>% 
  st_coordinates() %>%
  as.data.frame() %>%
  group_by( L2 ) %>%
  summarise( minx = min(X, na.rm = TRUE), 
             maxx = max(X, na.rm = TRUE),
             miny = min(Y, na.rm = TRUE),
             maxy = max(Y, na.rm = TRUE) ) 
DT.grid   <- cbind(
  as.data.table( st_set_geometry( Grid, NULL ) ),
  Grid.coords )
#        ID L2 minx maxx miny maxy
# 1: GID001  1    0 1000    0 1000
# 2: GID002  2 1000 2000    0 1000
# 3: GID003  3 2000 3000    0 1000
# 4: GID004  4 3000 4000    0 1000
# 5: GID005  5 4000 5000    0 1000
# 6: GID006  6 5000 6000    0 1000
# ...

#perform non-equi join (assuming a point only cvna fall into 1 polygon!)
DT.points[ DT.grid, Grid.id := i.ID, on = .(X >= minx, X < maxx, Y >= miny, Y < maxy)][]
#         ID         X        Y Grid.id
#    1:    1 2875.7752 2058.269  GID023
#    2:    2 7883.0514 9425.390  GID098
#    3:    3 4089.7692 3793.238  GID035
#    4:    4 8830.1740 6262.401  GID069
#    5:    5 9404.6728 1835.024  GID020
# ---                                
# 2996: 2996 1370.2189 8770.564  GID082
# 2997: 2997 8619.4776 8827.825  GID089
# 2998: 2998  671.2062 9011.131  GID091
# 2999: 2999 4352.0844 7979.424  GID075
# 3000: 3000 6801.3496 9014.823  GID097
#