Subtract rows based on many columns

Question

I have this data.frame

DATA

df <- data.frame(id=c(rep("site1", 3), rep("site2", 8), rep("site3", 9), rep("site4", 15)),
                 major_rock = c("greywacke",    "mudstone", "gravel",   "greywacke",    "gravel",   "mudstone", "gravel", "mudstone", "mudstone",   
                                "conglomerate", "gravel", "mudstone",   "greywacke","conglomerate", "gravel",   "gravel",   "greywacke","gravel",   
                                "greywacke",    "gravel",   "mudstone", "greywacke",    "gravel", "gravel", "gravel",   "conglomerate", "greywacke",
                                "coquina",  "gravel",   "gravel",   "greywacke",    "gravel",   "mudstone","mudstone",  "gravel"),
                 minor_rock = c("sandstone mudstone basalt chert limestone",  "limestone",   "sand silt clay", "sandstone mudstone basalt chert limestone",
                                "sand silt clay", "sandstone conglomerate coquina tephra", NA, "limestone",  "mudstone sandstone coquina limestone",
                                "sandstone mudstone limestone",  "sand loess silt",  "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
                                "sandstone mudstone limestone", "sand loess silt", "loess silt sand", "sandstone mudstone conglomerate chert limestone basalt",
                                "sand silt clay",  "sandstone mudstone conglomerate", "loess sand silt", "sandstone conglomerate coquina tephra", "sandstone mudstone basalt chert limestone",
                                "sand loess silt", "sand silt clay", "loess silt sand",  "sandstone mudstone limestone", "sandstone mudstone conglomerate chert limestone basalt",
                                "limestone", "loess sand silt",  NA, "sandstone mudstone conglomerate", "sandstone siltstone mudstone limestone silt lignite", "limestone",
                                "mudstone sandstone coquina limestone", "mudstone tephra loess"),
                 area_ha = c(1066.68,   7.59,   3.41,   4434.76,    393.16, 361.69, 306.75, 124.93, 95.84,  9.3,    8.45,   4565.89,    2600.44,    2198.52,    
                             2131.71,   2050.09,    1640.47,    657.09, 296.73, 178.12, 10403.53,   8389.2,  8304.08,   3853.36,    2476.36,    2451.25,    
                             1640.47,   1023.02,    532.94, 385.68, 296.73, 132.45, 124.93, 109.12, 4.87))

In it, there are 4 sites, 2 of them are independent (site1 and site3 ; they don't include any site upstream) and 2 are dependent (site2 and site4; they include upstream site(s))

I want to create a new data.frame, let's call it df_indep. In which, I want all sites to be independent which means subtracting any upstream site(s) from the dependent sites as below

site1 and site3 will remain the same as they are independent

site2 (independent) = site2 - site1

site4 (independent) = site4 -(site2+site3)

Below is df only for major_rock and minor_rock combinations that have area_percent larger than 15% (before subtracting upstream sites; site2 and site3)

library(dplyr)
head(df %>% group_by(id) %>% 
       mutate(area_percent = area_ha/sum(area_ha)*100) %>% 
       filter(area_percent>5),15)


#       id   major_rock                                             minor_rock  area_ha area_percent
#   <fctr>       <fctr>                                                 <fctr>    <dbl>        <dbl>
#1   site1    greywacke              sandstone mudstone basalt chert limestone  1066.68    98.979289
#2   site2    greywacke              sandstone mudstone basalt chert limestone  4434.76    77.329604
#3   site2       gravel                                         sand silt clay   393.16     6.855592
#4   site2     mudstone                  sandstone conglomerate coquina tephra   361.69     6.306845
#5   site2       gravel                                                     NA   306.75     5.348848
#6   site3     mudstone                  sandstone conglomerate coquina tephra  4565.89    27.978879
#7   site3    greywacke              sandstone mudstone basalt chert limestone  2600.44    15.934986
#8   site3 conglomerate                           sandstone mudstone limestone  2198.52    13.472099
#9   site3       gravel                                        sand loess silt  2131.71    13.062701
#10  site3       gravel                                        loess silt sand  2050.09    12.562550
#11  site3    greywacke sandstone mudstone conglomerate chert limestone basalt  1640.47    10.052479
#12  site4     mudstone                  sandstone conglomerate coquina tephra 10403.53    25.925869
#13  site4    greywacke              sandstone mudstone basalt chert limestone  8389.20    20.906106
#14  site4       gravel                                        sand loess silt  8304.08    20.693984
#15  site4       gravel                                         sand silt clay  3853.36     9.602674

and here is the

FINAL RESULT

I want after subtracting upstream sites

#       id   major_rock                                             minor_rock area_ha area_percent
#1   site1    greywacke              sandstone mudstone basalt chert limestone 1066.68    98.979289
#2   site2    greywacke              sandstone mudstone basalt chert limestone 3368.08    72.319849
#3   site2       gravel                                         sand silt clay  389.75     8.368762
#4   site2     mudstone                  sandstone conglomerate coquina tephra  361.69     7.766254
#5   site2       gravel                                                     NA  306.75     6.586576
#6   site3     mudstone                  sandstone conglomerate coquina tephra 4565.89    27.978879
#7   site3    greywacke              sandstone mudstone basalt chert limestone 2600.44    15.934986
#8   site3 conglomerate                           sandstone mudstone limestone 2198.52    13.472099
#9   site3       gravel                                        sand loess silt 2131.71    13.062701
#10  site3       gravel                                        loess silt sand 2050.09    12.562550
#11  site3    greywacke sandstone mudstone conglomerate chert limestone basalt 1640.47    10.052479
#12  site4     mudstone                  sandstone conglomerate coquina tephra 5475.95    30.297305
#13  site4    greywacke              sandstone mudstone basalt chert limestone 1354.00     7.491403
#14  site4       gravel                                        sand loess silt 6163.92    34.103701
#15  site4       gravel                                         sand silt clay 2803.11    15.509031

I will appreciate any suggestions on how to do this in R.

UPDATE

This is a map showing all 4 sites

The figure below shows site4 (accumulative as in df) and site1 (independent) as I want in the final output after subtracting site2 and site3

The figure below shows the same for site2 (accumulative) and indepenendent

Regarding @rbierman's question of how sites dependcies are encoded, please check below.

#      id dependent dep_site1 dep_site2 dep_site3
#1  site1        no        no        no        no
#2  site1        no        no        no        no
#3  site1        no        no        no        no
#4  site2       yes       yes        no        no
#5  site2       yes       yes        no        no
#6  site2       yes       yes        no        no
#7  site2       yes       yes        no        no
#8  site2       yes       yes        no        no
#9  site2       yes       yes        no        no
#10 site2       yes       yes        no        no
#11 site2       yes       yes        no        no
#12 site3        no        no        no        no
#13 site3        no        no        no        no
#14 site3        no        no        no        no
#15 site3        no        no        no        no
#16 site3        no        no        no        no
#17 site3        no        no        no        no
#18 site3        no        no        no        no
#19 site3        no        no        no        no
#20 site3        no        no        no        no
#21 site4       yes       yes       yes       yes
#22 site4       yes       yes       yes       yes
#23 site4       yes       yes       yes       yes
#24 site4       yes       yes       yes       yes
#25 site4       yes       yes       yes       yes
#26 site4       yes       yes       yes       yes
#27 site4       yes       yes       yes       yes
#28 site4       yes       yes       yes       yes
#29 site4       yes       yes       yes       yes
#30 site4       yes       yes       yes       yes
#31 site4       yes       yes       yes       yes
#32 site4       yes       yes       yes       yes
#33 site4       yes       yes       yes       yes
#34 site4       yes       yes       yes       yes
#35 site4       yes       yes       yes       yes

If you already have the final result that you want, what is your question exactly? — Samuel
@Samuel I got the final result for 4 sites ONLY using Microsoft Excel. My data include around 150 sites. I just presented 4 sites in this reproducible example to help anyone who will answer the question get an idea about the output I want. So, my question is "how can I get the final result (that I got through Excel) using R?" — shiny
You mention subtracting, what exactly entails this operation regarding your data? Could you explain a little. I am still confused as to what you need help with. — Samuel
how are the site dependencies encoded? If you gave me all 150 sites how would I know if site 76 is independent or not? — mitoRibo

Gregor Thomas Gregor Thomas · Accepted Answer · 2017-03-01T23:34:42

This isn't too bad, just a little renaming and joining.

First we'll want dependencies in a nice two-column format. You can use reshape2::melt or tidyr::gather on the wide dependencies you posted to make them long:

deps = data.frame(
    id = c("site2", "site4", "site4"),
    dependency = c("site1", "site2", "site3"),
    stringsAsFactors = FALSE
)
#      id dependency
# 1 site2      site1
# 2 site4      site2
# 3 site4      site3

Using dplyr for joins, we'll also want character rather than factor columns just in case the levels aren't all the same.

    library(dplyr)    
df = mutate_at(df, .cols = c("id", "major_rock", "minor_rock"), .funs = funs(as.character))

First we create a "dependency with measure" data frame that has a clearly dependent name for the area and id (edit) and then we aggregate it to the id level, summing the dependent areas:

dep_w_measure = df %>%
    select(dependency = id, major_rock, minor_rock, dep_area = area_ha) %>%
    inner_join(deps) %>%
    group_by(id, major_rock, minor_rock) %>%
    summarize(dep_area = sum(dep_area))

Then we join that to the original data, and subtract off the dependent area (where present):

result = left_join(df, dep_w_measure, by = c("major_rock", "minor_rock", "id")) %>%
    mutate(area_ind = area_ha - coalesce(dep_area, 0))
head(result)
#      id major_rock                                minor_rock area_ha dep_area area_ind
# 1 site1  greywacke sandstone mudstone basalt chert limestone 1066.68       NA  1066.68
# 2 site1   mudstone                                 limestone    7.59       NA     7.59
# 3 site1     gravel                            sand silt clay    3.41       NA     3.41
# 4 site2  greywacke sandstone mudstone basalt chert limestone 4434.76  1066.68  3368.08
# 5 site2     gravel                            sand silt clay  393.16     3.41   389.75
# 6 site2   mudstone     sandstone conglomerate coquina tephra  361.69       NA   361.69

I left the dep_area and area_ha columns in to "show my work", you can clean it up as needed. The independent area area_ind column matches the area_ha in your desired output.

Subtract rows based on many columns

1 Answers