1
votes

Is there a way to estimate the correlation of two variables if the data is received in chunks without storing the received pairs?

For example, we receive the pairs:

  1. [(x1, y1), (x2, y2), (x3, y3)]

  2. [(x4, y4)]

  3. [(x5, y5), (x6, y6)]

and we have to estimate the correlation between x1:6 and y1:6.

Non-optimal solution:

Even though this definition works: Correlation

it is suboptimal since if we have large values on the stream the squared values will easily overflow.

1

1 Answers

2
votes

Yes, this can be computed incrementally. The method is a small generalisation of Welford's algorithm, see here, for example

You maintain a number of variables, updating them each time data comes in. At each stage these are the mean etc of the data seen so far

Initialisation:

int n = 0; // number of points
double mx = 0.0; // mean of x's
double my = 0.0; // mean of y's
double vx = 0.0; // variance of x's
double vy = 0.0; // variance of y's
double cxy = 0.0; // covariance of x and y

Update (new values x,y in )

  n += 1;
double f = 1.0/n;
double dx = x - mx;
double dy = y - my;
  mx += f*dx;
  my += f*dy;
  vx = (1.0-f)*(vx + f*dx*dx);
  vy = (1.0-f)*(vy + f*dy*dy);
  cxy= (1.0-f)*(cxy+ f*dx*dy);

In terms of these variables we have

rxy = cxy/sqrt( vx*vy)

Note though that vx and vy will be zero after just one pair as been seen.

Don't be surprised if the stream of estimates for rxy is noisy. Estimates of correlation tend to be so.