1
votes

I wanted to know if it's possible to estimate the correlation of a stream of x and y values on multiple nodes and aggregate on a master node. The single node solution has been previously answered here.

How could we aggregate means, variances, and more important covariance without storing all the values? Is it possible?

1

1 Answers

2
votes

Yes, you can. Suppose for example you have accumulated

int n1; // number of points
double m1x; // mean of x1's
double m1y; // mean of y1's
double v1x; // variance of x1's
double v1y; // variance of y1's
double c1xy; // covariance of x1 and y1

and analogous variables n2 etc for the x2's and y2's

These variables can be combined to get the statistics for the combined data set by

n = n1 + n2
mx = (n1*mx1 + n2*mx2)/n
my = (n1*my1 + n2*my2)/n
vx = (n1*v1x + n1*(mx1-mx)*(my1-my)
     +n2*v2x + n2*(mx2-mx)*(my2-my)
     )
vy = (n1*v1y + n1*(my1-my)*(my1-my)
     +n2*v2y + n2*(my2-my)*(my2-my)
     )
cxy = (n1*c1xy + n1*(mx1-mx)*(my1-my)
      +n2*c2xy + n2*(mx2-mx)*(my2-my)
      )

For example

cxy = ( Sum{ i | (x1[i]-mx)*(y1[i]-my)}
      + Sum{ i | (x2[i]-mx)*(y2[i]-my)}
      )/n
cxy = ( Sum{ i | (x1[i]-mx1+mx1-mx)*(y1[i]-my1+my1-my)}
      + Sum{ i | (x2[i]-mx2+mx2-mx)*(y2[i]-my2+my2-my)}
      )/n

But, expanding the first sum, we get

Sum{ i | (x1[i]-mx1)*(y1[i]-my1)}
+ Sum{ i | (x1[i]-mx1)}*(my1-my)
+ Sum{ i | (y1[i]-my1)}*(mx1-mx)
+ Sum{ i | 1} * (mx1-mx) * (my1-my)

The middle two sums are 0, so the first sum is

n1*c1xy + n1*(mx1-mx)*(my1-my)

The second sum is analogous, and adding them and dividing by n, we get the formula for cxy