0
votes

I contain a dataset 'area'

House_No. Info_On_Area
1a        Names of neighbouringhouse in 100m  1b   1c    1d    1e 
1a        Area of neighbouringhouse  in 100m  500  1000  1500  300
1a        Names of neighbouringhouse in 300m  1b   1c    1d    1e   1f    1g   1h
1a        Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000
2a        Names of neighbouringhouse in 100m  2b   2c    2d    2e 
2a        Area of neighbouringhouse  in 100m  500  1000  1500  300
2a        Names of neighbouringhouse in 300m  2b   2c    2d    2e   2f    2g   2h
2a        Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000

I want to create a dataframe where I can have the table to appear as

House_No. Area of neighbouringhouse in 100m Area of neighbouringhouse  in 300m 

I used dplyr and grouped the different house numbers CT <- data.frame(but %>% group_by(House_No.)) and tried using rowSums. However, I got the error saying that the information is not numeric. I think that is because I need to make the numbers in the row values as numbers and I am not sure of how to do so. I got stuck at this stage and couldnt proceed further.

I did look into similar solutions but they dont seem to be having a data frame where they are working towards the summing row values such as Sum rows in data.frame or matrix, Sum by Rows in R.

I would be grateful for any help! Thank you :)

2
provide a sample of your data with dput(head(df, 10))AlexB

2 Answers

3
votes

Use stringr::str_extract_* to retreive the digits then do spread using pivot_wider

library(tidyverse)
df %>%  
   #extract everything up to 1+ digits followed by m
   mutate(flag = str_extract(Info_On_Area,'.*\\d+m'), 
          #extract any 1 or more digits followed by space or at the end
          SumArea = map_dbl(Info_On_Area, ~sum(as.numeric(str_extract_all(.x, '\\d+(?=\\s|$)', simplify = TRUE))))) %>% 
   filter(str_detect(Info_On_Area, 'Area')) %>% 
   #As suggested by @Uwe
   pivot_wider(id_cols = House_No., names_from = flag, values_from = SumArea)

# A tibble: 2 x 3
  House_No. `Area of neighbouringhouse  in 100m` `Area of neighbouringhouse  in 300m`
  <chr>                                    <dbl>                                <dbl>
1 1a                                        3300                                 6300
2 2a                                        3300                                 6300

Data

df <- structure(list(House_No. = c("1a", "1a", "1a", "1a", "2a", "2a", 
"2a", "2a"), Info_On_Area = c("Names of neighbouringhouse in 100m  1b   1c    1d    1e", 
"Area of neighbouringhouse  in 100m  500  1000  1500  300", "Names of neighbouringhouse in 300m  1b   1c    1d    1e   1f    1g   1h", 
"Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000", 
"Names of neighbouringhouse in 100m  2b   2c    2d    2e", "Area of neighbouringhouse  in 100m  500  1000  1500  300", 
"Names of neighbouringhouse in 300m  2b   2c    2d    2e   2f    2g   2h", 
"Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000"
)), class = "data.frame", row.names = c(NA, -8L))
2
votes

The difficulty here is that the information is presented in a mixture of wide and long format. Info_On_Area is a character column which contains the variable name as well as an arbitrary number of values separated by white space. Therefore, Info_On_Area needs to be splitted in two steps. First, to extract the variable name and second to extract the numbers for subsequent conversion to numeric and summation.

Fortunately, the OP is only interested in the area information which simplifies matters.

1. tidyverse approach

library(dplyr)
library(purrr)
library(stringr)
library(tidyr)
area %>% 
  filter(Info_On_Area %>% str_detect("^Area")) %>% 
  separate(Info_On_Area, c("var", "val"), sep = "(?<=00m)") %>% 
  mutate(Area = map_int(val, ~ str_extract_all(. , "\\d+") %>% unlist() %>% as.integer() %>% sum())) %>%
  pivot_wider(id_cols = House_No., names_from = var, values_from = Area)
# A tibble: 2 x 3
  House_No. `Area of neighbouringhouse  in 100m` `Area of neighbouringhouse  in 300m`
  <chr>                                    <int>                                <int>
1 1a                                        3300                                 6300
2 2a                                        3300                                 6300

The result has one row for each House_No.. This is different to A. Suliman's solution which shows two rows for each House_No. (no longer in the edited version of A. Suliman's answer). Other differences include the use of the separate() and pivot_wider() functions, a regular expression with lookbehind "(?<=00m)", and to apply filter() as the first step in the pipeline.

2. data.table approach

For the sake of completeness, here is also a data.table solution:

library(data.table)
library(magrittr)
setDT(area)[Info_On_Area %like% "^Area", 
            c(.(House_No.= House_No.), tstrsplit(Info_On_Area, "(?<=00m)", perl = TRUE))][
              , str_extract_all(V3, "\\d+") %>% unlist() %>% as.integer() %>% sum(), by = .(House_No., V2)][
                , dcast(.SD, House_No. ~ V2, value.var = "V1")]
   House_No. Area of neighbouringhouse  in 100m Area of neighbouringhouse  in 300m
1:        1a                               3300                               6300
2:        2a                               3300                               6300

Data

area <- structure(list(House_No. = c("1a", "1a", "1a", "1a", "2a", "2a", 
"2a", "2a"), Info_On_Area = c("Names of neighbouringhouse in 100m  1b   1c    1d    1e", 
"Area of neighbouringhouse  in 100m  500  1000  1500  300", "Names of neighbouringhouse in 300m  1b   1c    1d    1e   1f    1g   1h", 
"Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000", 
"Names of neighbouringhouse in 100m  2b   2c    2d    2e", "Area of neighbouringhouse  in 100m  500  1000  1500  300", 
"Names of neighbouringhouse in 300m  2b   2c    2d    2e   2f    2g   2h", 
"Area of neighbouringhouse  in 300m  500  1000  1500  300  600   400  2000"
)), class = "data.frame", row.names = c(NA, -8L))