Grouping by statistical significance in R

Question

I am trying to figure out the best way to group my data into factors to use for creating an MDS plot. I have annual discharge data that I would like to break into "high", "medium" and "low" flow years (or just high and low if a medium can't be determined). My issue is that I don't know the best way to do this. I can look at plots and pick the high, med, low, but I want a statistically sound way of doing this rather than just eyeballing it.

I have this data

structure(list(Zone = c("B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B"), Year = c(2008L, 2009L, 2010L, 2011L, 2012L, 
2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 2008L, 2009L, 2010L, 
2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 2008L, 
2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 
2018L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 
2016L, 2017L, 2018L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L, 
2014L, 2015L, 2016L, 2017L, 2018L, 2008L, 2009L, 2010L, 2011L, 
2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 2008L, 2009L, 
2010L, 2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L
), Month = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 
10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 
11L, 11L, 11L, 11L, 11L), Discharge = c(10, 77.8, 273.6, 29.3, 
1.5, 86.1, 483.4, 84.8, 53.4, 50.3, 131.3, 3.79, 67.8, 97.1, 
10.7, 144.4, 62.6, 171.6, 29.2, 36.2, 179.6, 383.5, 2.39, 32.4, 
75.2, 7.85, 436.1, 630.1, 57.2, 27.7, 25.8, 73.7, 314.5, 175.7, 
22.8, 68, 4.82, 329.9, 573.8, 34.5, 55.1, 40.1, 150.3, 416.7, 
267.3, 37, 65.3, 5.41, 533.9, 255.2, 69.3, 36.8, 158.8, 95.7, 
271.1, 45.3, 19.9, 74.3, 2.15, 222.2, 59.5, 36, 29.5, 38.7, 56.4, 
65.2, 28.4, 17.2, 19.5, 1.31, 49, 30.8, 29.6, 39.5, 19.5, 35.8, 
123.8), Date = c("2008-05", "2009-05", "2010-05", "2011-05", 
"2012-05", "2013-05", "2014-05", "2015-05", "2016-05", "2017-05", 
"2018-05", "2008-06", "2009-06", "2010-06", "2011-06", "2012-06", 
"2013-06", "2014-06", "2015-06", "2016-06", "2017-06", "2018-06", 
"2008-07", "2009-07", "2010-07", "2011-07", "2012-07", "2013-07", 
"2014-07", "2015-07", "2016-07", "2017-07", "2018-07", "2008-08", 
"2009-08", "2010-08", "2011-08", "2012-08", "2013-08", "2014-08", 
"2015-08", "2016-08", "2017-08", "2018-08", "2008-09", "2009-09", 
"2010-09", "2011-09", "2012-09", "2013-09", "2014-09", "2015-09", 
"2016-09", "2017-09", "2018-09", "2008-10", "2009-10", "2010-10", 
"2011-10", "2012-10", "2013-10", "2014-10", "2015-10", "2016-10", 
"2017-10", "2018-10", "2008-11", "2009-11", "2010-11", "2011-11", 
"2012-11", "2013-11", "2014-11", "2015-11", "2016-11", "2017-11", 
"2018-11")), row.names = c(228L, 229L, 230L, 231L, 232L, 233L, 
234L, 235L, 236L, 237L, 238L, 242L, 243L, 244L, 245L, 246L, 247L, 
248L, 249L, 250L, 251L, 252L, 256L, 257L, 258L, 259L, 260L, 261L, 
262L, 263L, 264L, 265L, 266L, 270L, 271L, 272L, 273L, 274L, 275L, 
276L, 277L, 278L, 279L, 280L, 284L, 285L, 286L, 287L, 288L, 289L, 
290L, 291L, 292L, 293L, 294L, 298L, 299L, 300L, 301L, 302L, 303L, 
304L, 305L, 306L, 307L, 308L, 312L, 313L, 314L, 315L, 316L, 317L, 
318L, 319L, 320L, 321L, 322L), class = "data.frame")

I ran a pairwise comparison using Wilcoxon rank sum test to determine significant differences between discharge rates for each year and got this:

pairwise.wilcox.test( B_Dis_data$Discharge, B_Dis_data$Year, p.adjust.method = "BH")
    Pairwise comparisons using Wilcoxon rank sum test 

data:  B_Dis_data$Discharge and B_Dis_data$Year 

     2008  2009  2010  2011  2012  2013  2014  2015  2016  2017 
2009 0.868 -     -     -     -     -     -     -     -     -    
2010 0.328 0.191 -     -     -     -     -     -     -     -    
2011 0.283 0.020 0.011 -     -     -     -     -     -     -    
2012 0.235 0.133 0.366 0.069 -     -     -     -     -     -    
2013 0.167 0.110 0.781 0.008 1.000 -     -     -     -     -    
2014 0.366 0.283 0.710 0.008 0.437 0.366 -     -     -     -    
2015 0.710 0.654 0.191 0.014 0.133 0.069 0.328 -     -     -    
2016 0.781 0.654 0.208 0.014 0.167 0.110 0.514 1.000 -     -    
2017 0.328 0.110 0.954 0.008 0.366 0.583 0.710 0.110 0.191 -    
2018 0.069 0.014 0.133 0.008 1.000 0.583 0.191 0.011 0.020 0.110

P value adjustment method: BH

Any advice would be greatly appreciated!!

This seems like more of a stats/hydrology question rather than an R question. Perhaps you could sum discharge over the course of a year and then create categories using cut()? — John Harley
@John Harley I'm not familiar with the cut() function.. can you provide an example of how this would work? — user13314714
This doesn't appear to be a specific programming question that's appropriate for Stack Overflow. If you have general questions about the appropriate use of various statistical methods, then you should ask such questions over at Cross Validated instead. You are more likely to get better answers there. — MrFlick
I was afraid that might be the case. I asked this question over there as well but have not received any feedback yet.. — user13314714

jay.sf jay.sf · Accepted Answer · 2020-06-30T06:28:57

Referring to your question about cut:

You could try to average to year means and accordingly cut the data into the three categories.

B_Dis_data <- within(B_Dis_data, {
  Discharge.M <- ave(Discharge, Year, FUN=mean)
  Discharge.C.eq <- cut(Discharge.M, breaks=3, c("low", "medium", "high"))
  Discharge.C.qu <- cut(B_Dis_data$Discharge.M, 
                        quantile(Discharge.M, seq(0, 1, length.out=4)), 
                        c("low", "medium", "high"), include.lowest=TRUE)
})

Either cut them into three equal sized intervals using breaks=3 (Discharge.C.eq) or use the quantiles 1/3 and 2/3 as breakpoints (Discharge.C.qu) (which might be better).

What is happening within cut:

Equal sized intervals:

levels(cut(B_Dis_data$Discharge.M, 3))
# [1] "(8.55,87.6]" "(87.6,166]"  "(166,246]" 
## same as: 
(B_Dis.rg <- range(B_Dis_data$Discharge.M))
# [1]   8.791429 245.285714
c(B_Dis.rg[1], B_Dis.rg[1] + (B_Dis.rg[2] - B_Dis.rg[1])/3*c(1:3))
# [1]   8.791429  87.622857 166.454286 245.285714

Quantiles (terciles):

quantile(B_Dis_data$Discharge.M, seq(0, 1, length.out=4))
#       0%  33.33333%  66.66667%       100% 
# 8.791429  53.214286 125.942857 245.285714

Grouping by statistical significance in R

1 Answers