1
votes

I am using this PHP routine to calc Pearson Correlation:

function correlation ($x,$y) {
    $length = count($x);
    $mean1 = array_sum($x)/$length;
    $mean2 = array_sum($y)/$length;
    $a = $b = 0;
    $a2 = $b2 = 0;
    $axb = 0;
    for ($i = 0; $i < $length; $i++) {
        $a = $x[$i]-$mean1;
        $b = $y[$i]-$mean2;
        $axb +=$a*$b;
        $a2 += pow($a,2);
        $b2 += pow($b,2);
    }
    if ($sqrt = sqrt($a2*$b2))
        return $axb/$sqrt;
    return 0;
}

When I test it for several conditions it returns 0 on exact matchs:

echo correlation([0,0,0,0,0],[0,0,0,0,0]); // Returns 0!!
echo correlation([0,0,0,0,0],[1,1,1,1,1]); // Returns 0!!
echo correlation([1,1,1,1,1],[1,1,1,1,1]); // Returns 0!!
echo correlation([0,0,0,0,0],[9,9,9,9,9]); // Returns 0!!
echo correlation([0,0,0,0,0],[0,1,2,3,4]); // Returns 0 OK
echo correlation([9,9,9,9,9],[0,1,2,3,4]); // Returns 0 OK
echo correlation([0,1,2,3,4],[0,1,2,3,4]); // Returns 1 OK

Why? and How to accomplish that? Thank you!


For info:

A Pearson correlation is a number between -1 and 1 that indicates the extent to which two variables are linearly related. The Pearson correlation is also known as the “product moment correlation coefficient” (PMCC) or simply “correlation”.

1
" How to accomplish that?"...how to accomplish what? You didn't mention what the expected result was - ADyson
I already forgot all the stats I learnt at university but you get zero because that's your default return value to avoid division by zero. When you basically have a single dot in your scatter plot it doesn't make sense to calculate correlation. - Álvaro González
@ADyson 00000 and 00000 are equal and has full correlation, so why it returns 0 instead of 1, and how to make it return 1? - Digerkam
1 means your values form a perfect line and Y grows when X grows. Two equal values can't form a line (or they can form infinite lines in any direction, however you want to consider). Wikipedia has a nice chart. To get 1 just replace return 0; with return 1;—but I doubt it's mathematically correct. I think a better result would be null. - Álvaro González

1 Answers

2
votes

Approach 1 (doing at your own):

Using PHP to statistics is a hard path.

First of all, as you're using a weak typed language (you don't need to specify the types on variables), the language can interpret as int so, you need to set all of your variables on type float and execute again to run this. You can have some problems with float in PHP, see here why I talking this: https://3v4l.org/1FU9J

But if you don't mind about high precision, you can modify your precision you can set your round() function or you can set ini_set('precision', 3); to get the precision on your data.

Another thing. If you need precision, you need to use bc extension because floating point in PHP is a problem and can affect your results.

Look more about bc math extension here: https://www.php.net/manual/en/book.bc.php or try to use another language.

Some references about the floating point:


Approach 2 (using language functions):

And, PHP have some functions to help in this. So, if this isn't a homework to learn or something like this, you can try this: https://www.php.net/manual/en/function.stats-stat-correlation.php