2
votes

When using xcorr in MATLAB to cross correlate 2 related data sets, everything works as expected - I see a correlation peak and the lag reported is correct. However, when I use xcorr to cross correlate unrelated data sets where both data sets contain 1 cluster of "spikes", I see a correlation peak and the lag reported is the distance between the 2 spikes.

In this image:

image

x is a random data series. y is also a random data series. Both x and y have 30 random peaks inserted into the series in sequence. In theory, there should be no correlation between the 2 data sets since they are both very different. However, it can be seen from the 3rd plot that there is a very strong correlation between the 2 data sets. The code used to generate this figure is at the bottom of this post.

I've tried to filter the spikes using a few different mechanisms (rolling rms power ... etc) before performing the xcorr. This has worked in some cases but not all. I feel like I need a different approach to the problem, maybe an alternative to xcorr. I do understand why x and y cross correlate using xcorr. Is there another cross correlation tool that I can use? Note x and y will never be exactly the same, they will only ever be approximately the same but in normal operation, it's not the spikes that should make them correlate.

Any suggestions on how to tell if x and y correlate while also ignoring the "spikes"?

Here is some my example code:

x = rand(1, 3000);
x = x - 0.5;
y = rand(1, 3000);
y = y - 0.5;
% insert the impulses into the data
impulse_width = 30;
impulse_max_height = 6;
x_impulse_start = 460;
y_impulse_start = 120;
rand_insert_x = rand(1, impulse_width);
rand_insert_x = (rand_insert_x - 0.5) * 2 * impulse_max_height;
rand_insert_y = rand(1, impulse_width);
rand_insert_y = (rand_insert_y - 0.5) * 2 * impulse_max_height;
x(1,x_impulse_start:x_impulse_start + impulse_width - 1) = rand_insert_x;
y(1,y_impulse_start:y_impulse_start + impulse_width - 1) = rand_insert_y;
subplot(3, 1, 1);
plot(x);
ylim([-impulse_max_height impulse_max_height]);
title('random data series: x');
subplot(3, 1, 2);
plot(y);
ylim([-impulse_max_height impulse_max_height]);
title('random data series: y');
[c, l] = xcorr(x, y);
subplot(3, 1, 3);
plot(l, c);
title('correlation using xcorr');
3
I see why this is a problem, but If I wouldnt have read your post I would think: that xcorr is doing a good job, as it is aligning those quite similar signals together!Ander Biguri
I think, the first step should be how do you define whether or not a "bunch of data" is considered a spike or not.rst

3 Answers

1
votes

The way to solve this is to use normalized cross-correlation.

In normalize cross-correlation the correlation is 1 when the signals are exactly the same, and less when they are not. You can see it as "percentage of similarity".

To do that in MATLAB, you just need to add 'coeff' as an argument to your code.

So, if I change your code to [c, l] = xcorr(x, y,'coeff'); the plot I get is the nest:

(note I changed sample size to 600 to make it more readable)

enter image description here

the cross-correlation gets to 0.3 there, so not much. However, if we change your code lines to

x(1,x_impulse_start:x_impulse_start + impulse_width - 1) = rand_insert_x;
y(1,y_impulse_start:y_impulse_start + impulse_width - 1) = rand_insert_x;

and insert the same random patter in both signals, then we get:

enter image description here

Now, the cross-correlation gets to a high value, almost 1, but not one, because the big random pattern there is the same, but the rest of the signal is not.

0
votes

The cross-correlation is the convolution of two signals. Imagine that during the cross correlation, the two signals are at lags like I have shown here (x-axis labels should be completely ignored): enter image description here

The positive (+) spike in series x (~ sample 490) is multiplied by the negative (-) spike in series y (~ sample 121), resulting in a large negative value in the xcorr, which we actually see in the bottom plot (~ sample 315). This large negative value will be added by something close to 0 since the rest of the signals are indeed low-power noise. I am afraid that no matter what xcorr function you use, you should get the same result. In fact, if there is another function that claims to be a cross-correlator, but doesn't give the same result as xcorr() then that function should not be called a cross-correlator. I hope this helps.

0
votes

My understanding of the question is "How do I remove these spikes from my data?"

The answer is find something characteristic about those spikes, and then test each time window for that characteristic. If that test passes, then you have detected a spike, and you should remove that data.

For example, you might say "A spike is any time point that has an absolute value greater than some threshold." You determine the threshold using your data, say 0.2. Then you do something like

spikeless_data = data .* (abs(data)<0.2);

which copies data when abs(data)<0.2 and sets it to 0 when not.

You could also notice that a characteristic of spikes is that their derivative is very large, which might be more robust than a simple threshold. This would correspond to spikeless_data = data .* ([abs(diff(data)), 0] < some_threshold);

You will have to play around to find something that works for your data.