0
votes

I hope someone can provide me with a little guidance or help. I have a dataset consisting of a population that has been tested for an infection across three years. Some of the individuals, not all, have been sampled in more than one year (so they represent repeat measures). I want to determine if prevalence of the infection is changing over time, but I am having troubling determining the appropriate test. A simple contingency test violates assumptions of independence, because of the individuals that are repeated across years. I don't think that the Cochran-Mantel-Haenszel test or the McNemar Chi-square test are appropriate, but feel free to correct me if I am wrong. Here is the data set that I am working with, the "AnID" variable is a factor that represents a single individual (so if an individual is sampled in multiple years you'll see that number repeated 2 or 3 times).

I think a viable option would be to randomly re-sample the data many times (without replacement), each time only including an individual once, and perform a contingency test across years. If the null hypothesis of no difference is rejected at least 95% of the time, then I could reliably claim that there is a difference. I am not good enough with r yet to write my own code for this. Thanks in advance for any help you can offer.

dput(example) structure(list(AnID = structure(c(37L, 37L, 45L, 45L, 45L, 55L, 55L, 62L, 62L, 68L, 68L, 1L, 1L, 2L, 3L, 3L, 4L, 9L, 9L, 18L, 18L, 18L, 19L, 19L, 19L, 20L, 20L, 21L, 22L, 22L, 23L, 24L, 24L, 24L, 25L, 25L, 25L, 26L, 27L, 28L, 28L, 28L, 29L, 29L, 29L, 30L, 31L, 32L, 32L, 33L, 34L, 35L, 36L, 38L, 38L, 39L, 39L, 40L, 41L, 41L, 42L, 42L, 42L, 43L, 43L, 43L, 44L, 46L, 46L, 46L, 47L, 47L, 47L, 48L, 48L, 48L, 49L, 49L, 49L, 50L, 51L, 52L, 52L, 53L, 53L, 54L, 54L, 56L, 56L, 57L, 57L, 57L, 58L, 59L, 60L, 61L, 63L, 64L, 65L, 66L, 67L, 69L, 70L, 71L, 72L, 73L, 74L, 74L, 5L, 6L, 7L, 8L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L), .Label = c("10", "11", "12", "13", "136", "137", "138", "139", "14", "140", "141", "142", "143", "144", "145", "146", "147", "26", "27", "28", "29", "30", "31", "37", "38", "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", "5", "50", "51", "52", "53", "57", "58", "59", "6", "60", "61", "62", "63", "64", "65", "66", "67", "69", "7", "70", "71", "72", "75", "76", "77", "8", "82", "83", "84", "85", "86", "9", "90", "94", "95", "96", "97", "98"), class = "factor"), year = structure(c(1L, 2L, 1L, 2L, 3L, 1L, 2L, 2L, 3L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 2L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 2L, 3L, 2L, 1L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 3L, 2L, 2L, 1L, 2L, 3L, 1L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 2L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("2012", "2013", "2014"), class = "factor"), value = c("Pos", "Pos", "Pos", "Pos", "Pos", "Neg", "Neg", "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", "Neg", "Neg", "Pos", "Neg", "Pos", "Pos", "Neg", "Pos", "Pos", "Neg", "Neg", "Neg", "Neg", "Neg", "Neg", "Pos", "Pos", "Pos", "Pos", "Pos", "Pos", "Neg", "Pos", "Pos", "Neg", "Neg", "Neg", "Neg", "Pos", "Pos", "Pos", "Pos", "Neg", "Neg", "Pos", "Pos", "Neg", "Pos", "Neg", "Pos", "Neg", "Neg", "Neg", "Neg", "Neg", "Neg", "Neg", "Pos", "Pos", "Pos", "Neg", "Pos", "Pos", "Neg", "Neg", "Pos", "Neg", "Neg", "Neg", "Neg", "Neg", "Neg", "Neg", "Neg", "Pos", "Pos", "Neg", "Neg", "Neg", "Pos", "Pos", "Pos", "Pos", "Pos", "Neg", "Neg", "Neg", "Pos", "Pos", "Neg", "Neg", "Neg", "Neg", "Neg", "Neg", "Pos", "Neg", "Neg", "Neg", "Neg", "Neg", "Neg", "Neg", "Pos", "Pos", "Neg", "Neg", "Neg", "Pos", "Pos", "Pos", "Neg", "Neg", "Pos", "Neg", "Pos", "Neg")), .Names = c("AnID", "year", "value"), row.names = 187:306, class = "data.frame")

1

1 Answers

1
votes

Keep in mind that experiment/test designs require an efficient sample size calculation in advance in order for you to maximize likelihood of capturing a statistically significant difference if it exists. (For more info see here: https://en.wikipedia.org/wiki/Sample_size_determination and https://en.wikipedia.org/wiki/Statistical_power).

If all your users were before/after subjects (like test/contol) you could have performed a McNemar's test for proportion comparison (See here: https://en.wikipedia.org/wiki/McNemar's_test).

However, not all users have repeated measurements, soI'd chose to pick randomly one year for each user so I can have 3 independent samples of values.

Assume that dt is your dataset...

library(dplyr)

set.seed(1)   # this will help you having a specific random sampling

dt %>%                      
  mutate(Pos = ifelse(value == "Pos", 1, 0)) %>%   # create a binary variable to flag positives
  group_by(AnID) %>%                               # for each user
  sample_n(1) %>%                                  # get one row/value randomly
  group_by(year) %>%                               # for each year
  summarise(N = n(),                               # get number of users
            N_Pos = sum(Pos),                      # get number of positive users
            Prc_Pos = mean(Pos)) %>%               # get percentage of positives
  print() -> tbl1                                  # print and save it

# # A tibble: 3 × 4
#     year     N N_Pos   Prc_Pos
#   <fctr> <int> <dbl>     <dbl>
# 1   2012    23     6 0.2608696
# 2   2013    27     9 0.3333333
# 3   2014    24    13 0.5416667

After observing the above percentages for each year you can run a proportions comparison

# run the statistical comparison of proportions
prop.test(tbl1$N_Pos, tbl1$N)

# 3-sample test for equality of proportions without continuity correction
# 
# data:  tbl1$N_Pos out of tbl1$N
# X-squared = 4.3038, df = 2, p-value = 0.1163
# alternative hypothesis: two.sided
# sample estimates:
#    prop 1    prop 2    prop 3 
# 0.2608696 0.3333333 0.5416667 

P value here (0.1163) suggests that we don't have any evidence of a difference between the years in terms of likelihood of being positive.

In case you find a difference you can run pairwise comparisons between the years.

# run pairwise comparisons 
pairwise.prop.test(tbl1$N_Pos, tbl1$N)

# Pairwise comparisons using Pairwise comparison of proportions 
# 
# data:  tbl1$N_Pos out of tbl1$N 
# 
# 1    2   
# 2 0.80 -   
# 3 0.29 0.45
# 
# P value adjustment method: holm 

The output here is 3 p values (of 3 pairs of comparisons). As expected all of them suggest no evidence of a difference between the years.

You can use the above process within a function and create N simulations. Check in how many of those simulations you'll find statistically significant results.