1
votes

I am developing a workflow processing script for dealing with sf objects in R. sf is the simple features class of objects which provide a means of dealing with spatial data in the tidyverse. However, I am having crippling difficulties doing standard group_by() %>% summarize() %>% mutate() processes with data stored as sf. I am experiencing an issue where group_by() %>% summarize() works with the object after it is converted to a data frame, but not as an sf.

Essentially I am trying to group lower level geographies by higher level geographies and output summary variables. I then need to mutate a variable in my new summarized sf data object that computes a sum across multiple variables and divides by another variable. With sf objects this last operation throws an error "x 'x' must be numeric" but the identical operation works for a data frame of the same data (just sans geography). And I've verified that x is numeric for all variables passed to the rowSums function.

Full reprex below. In the first example, you see the operation fails on the sf version of the sample data. In the second example, with as.data.frame() passed before the separate() function, the process succeeds, but this eliminates the geographies, which are crucial for my analysis.

Thanks, all!

library(sf)
#> Warning: package 'sf' was built under R version 4.0.2
#> Linking to GEOS 3.8.1, GDAL 3.1.1, PROJ 6.3.1
library(tidyverse)
#> Warning: package 'ggplot2' was built under R version 4.0.2
#> Warning: package 'tibble' was built under R version 4.0.2
#> Warning: package 'tidyr' was built under R version 4.0.2
#> Warning: package 'dplyr' was built under R version 4.0.2
library(dplyr)
library(spdep)
#> Loading required package: sp
#> Loading required package: spData
#> To access larger datasets in this package, install the spDataLarge
#> package with: `install.packages('spDataLarge',
#> repos='https://nowosad.github.io/drat/', type='source')`
library(stringi)
#> Warning: package 'stringi' was built under R version 4.0.2

nc <- st_read(system.file("shapes/sids.shp", package="spData")[1], quiet=TRUE)
st_crs(nc) <- "+proj=longlat +datum=NAD27"
row.names(nc) <- as.character(nc$FIPSNO)

names(nc)
#>  [1] "CNTY_ID"   "AREA"      "PERIMETER" "CNTY_"     "NAME"      "FIPS"     
#>  [7] "FIPSNO"    "CRESS_ID"  "BIR74"     "SID74"     "NWBIR74"   "BIR79"    
#> [13] "SID79"     "NWBIR79"   "east"      "north"     "x"         "y"        
#> [19] "lon"       "lat"       "L_id"      "M_id"      "geometry"

nc %>% 
  separate(CNTY_ID,into = c("ID1","ID2"),sep = 2,remove = FALSE) %>% 
  group_by(ID1) %>% 
  dplyr::summarize(AREA = sum(AREA, na.rm = TRUE), 
                   BIR74 = sum(BIR74,na.rm = TRUE), 
                   SID74 = sum(SID74,na.rm = TRUE), 
                   NWBIR74 = sum(NWBIR74,na.rm = TRUE)
                   ) %>% 
  mutate(stupid_var = rowSums(dplyr::select(.,'SID74':'NWBIR74'))/BIR74)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> Error: Problem with `mutate()` input `stupid_var`.
#> x 'x' must be numeric
#> ℹ Input `stupid_var` is `rowSums(dplyr::select(., "SID74":"NWBIR74"))/BIR74`.

class(nc$SID74)
#> [1] "numeric"
class(nc$NWBIR74)
#> [1] "numeric"
class(nc$BIR74)
#> [1] "numeric"

nc %>% 
  as.data.frame() %>% 
  separate(CNTY_ID,into = c("ID1","ID2"),sep = 2,remove = FALSE) %>% 
  group_by(ID1) %>% 
  dplyr::summarize(AREA = sum(AREA, na.rm = TRUE), 
                   BIR74 = sum(BIR74,na.rm = TRUE), 
                   SID74 = sum(SID74,na.rm = TRUE), 
                   NWBIR74 = sum(NWBIR74,na.rm = TRUE)
  ) %>% 
  mutate(stupid_var = rowSums(dplyr::select(.,'SID74':'NWBIR74'))/BIR74)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 5 x 6
#>   ID1    AREA  BIR74 SID74 NWBIR74 stupid_var
#>   <chr> <dbl>  <dbl> <dbl>   <dbl>      <dbl>
#> 1 18    2.53   36723    89   12788      0.351
#> 2 19    4.03  132525   203   38392      0.291
#> 3 20    3.94  111540   237   35281      0.318
#> 4 21    1.63   38117   106   14915      0.394
#> 5 22    0.494  11057    32    3723      0.340

Created on 2020-09-21 by the reprex package (v0.3.0)

2

2 Answers

1
votes

I made a change to the following line of code.

mutate(stupid_var = rowSums(dplyr::select(.,'SID74':'NWBIR74'))/BIR74)

This line of code was probably causing an issue. Unless I am missing something, it would appear there is no reason for summing the entire columns for each row. So the code was changed to remove the rowSums() function. The mutate function was still used to perform the math from the data on each row of data, but without involving any rowSums() values.

p1 <- nc %>% 
  separate(CNTY_ID,into = c("ID1","ID2"),sep = 2,remove = FALSE) %>% 
               group_by(ID1)  %>% 
               dplyr::summarize(AREA = sum(AREA, na.rm = TRUE), 
               BIR74 = sum(BIR74,na.rm = TRUE), 
               SID74 = sum(SID74,na.rm = TRUE), 
               NWBIR74 = sum(NWBIR74,na.rm = TRUE)) %>%
               mutate( stupid_var = ( (p2$SID74) + (p2$NWBIR74)) / (p2$BIR74) )
p1

The output can be viewed from this link.

0
votes

There is probably some reason why the city_ID was split into 2 variables, but you haven't provided any clues to the reason why. In the first answer I made the split, but I am ignoring using those split variables here.

Whenever data includes a sf geometry column, that sf geometry is sticky and will follow the data. Even when the data gets subsetted. And when that sf geometry is present, then it causes issues with basic column or row functions like sum(). So that geometry must get removed before the sum function gets used.

In this second answer, I used those same two variables that were used in answer # 1. The nc data gets subsetted for columns 8 & 9. My choice because there is no guidance about which columns get added together. Then the sf geometry gets dropped, and then the rowSums function is used to add the values from each column for every row.

gr_1 <- nc[, c(9:10)]
gr_1 <- st_drop_geometry(gr_1)     
rownames(gr_1) = NULL           # to remove extraneous data from gr_1

xsum <- c(rowSums(gr_1))
head(xsum)                             # displays values of xsum

The output can be view at this link: