3
votes

I am working on a large dataset (7 million rows) trying to understand the correlations between individual independent variables with dependent variables. When I run pcor(dataset) this results in higher correlations if compared when running cor(dataset).

My dataset has 6 dependent variables and 84 independent variables. I am finding the partial correlation for each dependent variable along with the 84 independent variables individually.

My independent variables are word counts for text type(75 categories), and some other social variables (all numerical) etc. gender.

My question is: I am not sure why I am getting high correlations when using pcor() in R and very weak correlation using cor(). Is this normal behavior for partial correlation?

1
pcor is not a base R function. Which package does pcor belong to and what does it do? - Maurits Evers
Google tells me that pcor is part of packages ppcor, RVAideMemoire and ggm; which one did you use? Furthermore, in partial correlation you are measuring the correlation between two variables while controlling for additional confounding variables. Obviously this will give you different results than if you use cor. - Maurits Evers
There's no general ordering between the correlation and the partial correlation. Take e.g. x=rnorm(10000), y=rnorm(10000) and z=(x+y)/2. Then cor(x,y) is almost zero, while pcor(x,y|z) is positive. The other ordering is possible too. - Stéphane Laurent
I am using ppcor package in R - AKD

1 Answers

5
votes

If you're wondering whether a partial correlation coefficient can be larger than a "full" correlation coefficient, consider the following example.

Let's take a look at the sample data from the ppcor reference manual

df <- data.frame(
    hl = c(7,15,19,15,21,22,57,15,20,18),
    disp = c(0.000,0.964,0.000,0.000,0.921,0.000,0.000,1.006,0.000,1.011),
    deg = c(9,2,3,4,1,3,1,3,6,1),
    BC = c(1.78e-02,1.05e-06,1.37e-05,7.18e-03,0.00e+00,0.00e+00,0.00e+00 ,4.48e-03,2.10e-06,0.00e+00))

According to the original paper, the data cover the relationship between sequence and functional evolution in yeast proteins, and is available from [Drummond et al., Molecular Biology and Evolution 23, 327–337 (2006)].

We are interested in exploring the correlation between hl and disp.

Linear relationship between hl and disp

Let's start by plotting hl as a function of disp

library(ggplot2)
ggplot(df, aes(hl, disp)) +
    geom_point()

enter image description here

The standard ("full") Pearson's product moment correlation coefficient is given by

with(df, cor(hl, disp))
#[1] -0.2378724

As is obvious from the plot and cor results, without controlling for any other variable, the linear relationship between hl on disp is not very strong.

Partial correlation

To recap the definition: Partial correlation between X and Y given confounding variables Z is defined as the correlation of the residuals resulting from a linear regression of X on Z an Y on Z.

Let's visualise the partial correlation by plotting the residuals of the two corresponding linear models hl ~ deg + BC and disp ~ deg + BC.

ggplot(data.frame(
    res.x = lm(hl ~ deg + BC, df)$residuals, 
    res.y = lm(disp ~ deg + BC, df)$residuals)) +
    geom_point(aes(res.x, res.y))

enter image description here

The linear dependence of both residuals is very obvious, suggesting a significant partial correlation between hl and disp. Let's confirm by calculating the partial correlation between hl and disp whilst controlling for confounding effects from deg and BC

pcor.test(df$hl, df$disp, df[, c("deg","BC")])
#    estimate    p.value statistic  n gp  Method
#1 -0.6720863 0.06789202 -2.223267 10  2 pearson

Conclusion

Pearson's product moment correlation coefficient between hl and disp is larger when we control for confounding variables, than the correlation coefficient when we do not control for confounders.