2
votes

I'm working with a 30*26000 size matrix that has NaNs at the beginning and at the end. NaNs are also sprinkled throughout each row. I can fill in the NaNs with linear interpolation but that will leave NaNs at the beginning and end of each row. Extrapolating to replace these NaNs at the ends is not ideal for my data set.

I want to just trim the matrix. Take for example a 3 by 6 matrix:

NaN NaN 1 2  3  NaN
NaN  1  2 3 NaN NaN
 1  NaN 2 3  4   5

Cut off the left most and right most columns such that no row begins or ends with a NaN.

1 2
2 3
2 3

So we are left with a 3 by 2 matrix.

How can I do this in Matlab? (speed-optimized; I will need to apply this to a million size matrix)

Thanks!

3

3 Answers

2
votes

Firstly, the vectorized solution of argyris will work perfectly well (+1). I'm only posting this because you emphasized that you wanted a speed optimized solution. Well, the downside of argyris solution is that the sum and isnan operation are performed on the entire matrix. This will be optimal if you have to come a long way in on either side to find the first non-NaN column. But what if you don't? A loop-based solution that exploits the fact that you may only need to come in a few columns may do better (particularly given how good the JIT accelerator is getting at executing single loops quickly). I've put together a speed test that includes both argyris and my solution:

%#Set up an example case using the matrix size you indicated in the question
T = 30;
N = 26000;
X = rand(T, N);
TrueL = 8;
TrueR = N - 8;
X(:, 1:TrueL) = NaN;
X(:, TrueR:end) = NaN;

%#argyris solution
tic
I1 = sum(isnan(X));
argL = find(I1 == 0, 1, 'first');
argR = find(I1 == 0, 1, 'last');
Soln1 = X(:, argL:argR);
toc

%#My loop based solution (faster if TrueL and TrueR are small)
tic
for n = 1:N
    if ~any(isnan(X(:, n)))
        break
    end
end
ColinL = n;
for n = N:-1:1
    if ~any(isnan(X(:, n)))
        break
    end
end
ColinR = n;
Soln2 = X(:, ColinL:ColinR);
toc

In the above example, the solution will need to get rid of the first 8 and last 8 columns. The outcome of the speed test?

Elapsed time is 0.002919 seconds. %#argyris solution
Elapsed time is 0.001007 seconds. %#My solution

The loop based solution is almost 3 times faster. Okay, now let's up the number of columns that we need to get rid of on either side to 100:

Elapsed time is 0.002769 seconds. %#argyris solution
Elapsed time is 0.001999 seconds. %#My solution

Still ahead. What about 1000 columns on either side?

Elapsed time is 0.003597 seconds. %#argyris solution
Elapsed time is 0.003719 seconds. %#My solution

So we've found our tipping point (on my machine at least - Quad core i7, Linux Mint v12, Matlab R2012b). Once we need to come in about 1000 columns on either side, we're better off using the vectorized solution.

One final note of CAUTION: If the routine is occurring inside another (possibly unrelated) loop, then speed comparisons should be re-done. This is because my solution will now involve a double loop. Even if the loops are unrelated, the JIT accelerator is not so good with double loops. I did some quick tests on my machine, and my solution still comes out ahead for small TrueL and TrueR (ie less than 100), but the advantage is not as large as it was when the outer loop was not present.

Anyway, hope this proves useful to you or anyone else who comes a-reading.

Cheers!

EDIT: I've done a few speed tests incorporating angainor's very neat one-liner (+1). It performs almost as well as my loop based solution when the number of columns to be removed is small. Suprisingly, it didn't scale that well when the number of columns to be removed is large, unlike argyris's solution. That may have something to do with the computer I'm on now though: work Windows machine - I've never really trusted it fully :-)

7
votes

For your example you can do the following:

let a your matrix with NaN and numerical values.

ind1 = sum(isnan(a),1); % count the NaN values along columns

s = find(ind1 == 0, 1, 'first'); % find the first column without any NaN

e = find(ind1 == 0, 1, 'last'); % find the last column without any NaN

So now just keep this part of the matrix from s-th to e-th column:

b = a(:,s:e);

Additional check may be needed for the case no column is clear of NaNs.

2
votes

Both earlier proposed solutions are great, I am posting this one-liner for completeness:

A(:,isfinite(sum(A)))

ans =

 1     2
 2     3
 2     3

It avoids going through the matrix entries twice (what Colin pointed out) by first calculating the row sums and after that calling isfinite. I also removed the find calls - they are not necessary since you can use logical indexing instead.

I do not have my computer here, so I leave out the performance tests.