0
votes

I am trying to compare two percentages/proportions for statistical significance in R, using a Chi-Square test. I am familiar with a SAS method for Chi Square in which I supply a dataset column for a numerator, another column for denominator, and a categorical variable to distinguish distributions (A/B).

However I am getting unexpected values in R using some examples sets. When I test two similar populations, with low sample sizes, I am getting p-values of (approximately) zero, where I would expect the p-values to be very high (~ 1).

My test set is below, where I went with sugar content in a batch of water: e.g. "does group A use the same ratio of sugar as group B?". My actual problem is similar, where this isn't a pass-fail type test and the numerator and denominator values can vary wildly between samples (different sugar and/or water weights per sample). My first objective is to verify that I can get a high p-value from two similar sets. The next question is, at what sample size does the p-value become low enough to indicate significance?

        # CREATE 2 NEARLY-EQUAL DISTRIBUTIONS (EXPECTING HIGH P-VALUE FROM PROP.TEST)
    set.seed(108)
    group_A =  tibble(group = "A", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
    group_B =  tibble(group = "B", sugar_lbs = rnorm(mean = 10, sd = 3, n = 50), batch_lbs = rnorm(mean = 30, sd = 6, n = 50))
    batches <- rbind(group_A, group_B) 

I then do a summarize to calculate the overall sugar percentage tendency between groups:

    # SUMMARY TOTALS
    totals <- batches %>%
        group_by(group) %>%
        summarize(batch_count = n(),
            batch_lbs_sum = sum(batch_lbs), 
            sugar_lbs_sum = sum(sugar_lbs),
            sugar_percent_overall = sugar_lbs_sum / batch_lbs_sum) %>%
        glimpse()

I then supply the sugar percentage between groups to a prop.test, expecting a high p-value

    # ADD P-VALUE & CONFIDENCE INTERVAL
    stats <- totals %>%
        rowwise() %>%
        summarize(p_val = prop.test(x = sugar_percent_overall, n =  batch_count, conf.level = 0.95, alternative = "two.sided")$p.value) %>%
        mutate(p_val = round(p_val, digits = 3)) %>%
        mutate(conf_level = 1 - p_val) %>%
        select(p_val, conf_level) %>%
        glimpse()
    
    # FINAL SUMMARY TABLE
    cbind(totals, stats) %>%
        glimpse()

Unforunately the final table gives me a p-value of 0, suggesting the two nearly-identical sets are independent/different. Shouldn't I get a p-value of ~1?

    Observations: 2
    Variables: 7
    $ group                 <chr> "A", "B"
    $ batch_count           <int> 50, 50
    $ batch_lbs_sum         <dbl> 1475.579, 1475.547
    $ sugar_lbs_sum         <dbl> 495.4983, 484.6928
    $ sugar_percent_overall <dbl> 0.3357992, 0.3284833
    $ p_val                 <dbl> 0, 0
    $ conf_level            <dbl> 1, 1

From another angle, I also tried to compare the recommended sample size from power.prop.test with an actual prop.test using this recommended sample size. This gave me the reverse problem -- I was a expecting low p-value, since I am using the recommended sample size, but instead get a p-value of ~1.

    # COMPARE PROP.TEST NEEDED COUNTS WITH AN ACTUAL PROP.TEXT
    power.prop.test(p1 = 0.33, p2 = 0.34, sig.level = 0.10, power = 0.80, alternative = "two.sided") ## n = 38154
    prop.test(x = c(0.33, 0.34), n = c(38154, 38154), conf.level = 0.90, alternative = "two.sided") ## p = 1 -- shouldn't p be < 0.10?

Am I using prop.test wrong or am I misinterpreting something? Ideally, I would prefer to skip the summarize step and simply supply the dataframe, the numerator column 'sugar_lbs', and the denominator 'batch_lbs' as I do in SAS -- is this possible in R?

(Apologies for any formatting issues as I'm new to posting)

---------------------------------

EDIT - EXAMPLE WITH ONLY PROPORTIONS & SAMPLE SIZE

I think my choice of using normal distributions may have distracted from the original question. I found an example that gets to the heart of what I was trying to ask, which is how to use prop test given only a proportion/percentage and the sample size. Instead of city_percent and city_total below, I could simply rename these to sugar_percent and batch_lbs. I think this reference answers my question, where prop.test appears to be the correct test to use.

My actual problem has an extremely non-normal distribution, but is not easily replicated via code.

STANFORD EXAMPLE (pages 37-50)

- https://web.stanford.edu/class/psych10/schedule/P10_W7L1

    df <- tibble(city = c("Atlanta", "Chicago", "NY", "SF"), washed = c(1175, 1329, 1169, 1521), not_washed = c(413, 180, 334, 215)) %>%
        mutate(city_total = washed + not_washed,
            city_percent = washed / city_total) %>%
        select(-washed, -not_washed) %>%
        glimpse()
    
    # STANFORD CALCULATION (p = 7.712265e-35)
    pchisq(161.74, df = 3, lower.tail = FALSE) 
    
    # PROP TEST VERSION (SAME RESULT, p = 7.712265e-35)
    prop.test(x = df$city_percent * df$city_total, n = df$city_total, alternative = "two.sided", conf.level = 0.95)$p.value

2

2 Answers

1
votes

The documentation for prop.test says:

Usage prop.test(x, n, p = NULL, alternative = c("two.sided", "less", "greater"), conf.level = 0.95, correct = TRUE)

Arguments

x a vector of counts of successes, a one-dimensional table with two entries, or a two-dimensional table (or matrix) with 2 columns, giving the counts of successes and failures, respectively.

n a vector of counts of trials; ignored if x is a matrix or a table.

So if you want a "correct" test, you would have to use sugar_lbs_sum as the x instead of sugar_percent_overall. You should still receive some kind of warning that the x is non-integral, but that's not my major concern.

But from a statistical perspective this is the complete wrong way of doing things. You are directly causing spurious correlation for a testing of difference between two quantities by dividing by their sum arbitrarily. If the samples (sugar_lbs_sum) are independent, but you divide by their sums, you have made the ratios dependent. This violates the assumptions of the statistical test in a critical way. Kronmal 1993 "Spurious correlation and the fallacy of the ratio" covers this.

The data you generated are independent normal, so don't sum them, rather test for a difference with the t-test.

0
votes

The Stanford link I added to my original post answered my question. I modified the Stanford example to simply rename the variables from city to group, and washed counts to sugar_lbs. I also doubled one batch, (or comparing a small versus large city). I now get the expected high p-value (0.65) indicating that there is no statistical significance that the proportions are different.

When I add more groups (for more degrees of freedom) and continue to vary batch sizes proportionally, I continue to get high p-values as expected, confirming the recipe is the same. If I modify the sugar percent of any one group, the p-value immediately drops to zero indicating one of the groups is different, as expected.

Finally, when doing the prop.text within a 'dplyr' pipe, I found I should not have used the rowwise() step, which causes my p-values to fall to zero. Removing this step gives the correct p-value. The only downside is that I don't yet know which group is different until I compare only 2 groups at a time iteratively.


#---------------------------------------------------------
# STANFORD EXAMPLE - MODIFIED TO SUGAR & ONE DOUBLE BATCHED
#--------------------------------------------------------
df <- tibble(group = c("A", "B"), sugar_lbs = c(495.5, 484.7), water_lbs = c(1475.6 - 495.5, 1475.6 - 484.7)) %>%
    mutate(sugar_lbs = ifelse(group == "B", sugar_lbs * 2, sugar_lbs),
        water_lbs = ifelse(group == "B", water_lbs * 2, water_lbs)) %>%
    mutate(batch_lbs = sugar_lbs + water_lbs,
        sugar_percent = sugar_lbs / batch_lbs) %>%
    glimpse()

sugar_ratio_all <- sum(df$sugar_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
water_ratio_all <- sum(df$water_lbs) / (sum(df$sugar_lbs) + sum(df$water_lbs))
dof <- (2 - 1) * (length(df$group) - 1)

df <- df %>%
    mutate(sugar_expected = (sugar_lbs + water_lbs) * sugar_ratio_all,
        water_expected = (sugar_lbs + water_lbs) * water_ratio_all) %>%
    mutate(sugar_chi_sq = (sugar_lbs - sugar_expected)^2 / sugar_expected,
        water_chi_sq = (water_lbs - water_expected)^2 / water_expected) %>%
    glimpse()

q <- sum(df$sugar_chi_sq) + sum(df$water_chi_sq)

# STANFORD CALCULATION
pchisq(q, df = dof, lower.tail = F)

# PROP TEST VERSION (SAME RESULT)
prop.test(x = df$sugar_percent * df$batch_lbs, n = df$batch_lbs, alternative = "two.sided", conf.level = 0.95)$p.value