1
votes

I've decided to get a little wild this evening and party with histogram bins to operate on some financial data I'm analyzing.

It appears the party has been pooped on, though, as the manner through which I would like to apply my 'intra-bin' operation is not readily apparent, neither through research nor playing around, and proving bothersome.

The Desire: I would like to use the 'binning' index within a column to perform some kind of row-wise 'intra-bin' operation where said operation will make a relative reference to the first element of its own bin. Please consider the following single bin example where the operation is to take a difference

A=

1   10.4
1   10.6
1   10.3
1   10.2

The relative operation will take the difference between all elements of column 2 and the 1st element of column 2 such that

bin_differencing_function(A)=

1   10.4    0.0
1   10.6    0.2
1   10.3    -0.1
1   10.2    -0.2

Now, still more convenient would be to be able to feed bin_differencing_function(A) a dual column matrix with an arbitrary number of bins such that if

A=

1   10.4
1   10.6
1   10.3
1   10.2
2   10.2
2   10.6
2   10.8
2   10.8
3   11.0
3   10.8
3   10.8
3   10.8

better_bin_differencing_function(A)=

1   10.4    0.0
1   10.6    0.2
1   10.3    -0.1
1   10.2    -0.2
2   10.2    0.0
2   10.6    0.4
2   10.8    0.6
2   10.8    0.6
3   11.0    0.0
3   10.8    -0.2
3   10.8    -0.2
3   10.8    -0.2

Most convenient would be to be able to feed better_bin_differencing_function(A) a dual column matrix with an arbitrary number of bins where the bin length may not be constant such that if

A=

1   10.4
1   10.6
1   10.3
2   10.2
2   10.6
2   10.8
2   10.8
2   10.7
3   11.0
3   10.8

best_bin_differencing_function(A)=

1   10.4    0.0
1   10.6    0.2
1   10.3    -0.1
2   10.2    0.0
2   10.6    0.4
2   10.8    0.6
2   10.8    0.6
2   10.7    0.5
3   11.0    0.0
3   10.8    -0.2

The big desire is to create a piece of code that takes advantage of vectorization (if possible) to operate on many bins who's lengths will vary between 1 and 200. I'm thinking a play on accumarray may do the trick such that

accumarray(A(:,1),A(:,2),[],@(x) fun(x))

Where fun(x) is a function with a for loop.

I'm running MATLAB 7.10.0.499 (R2010a) on Windows 7. Sorry the examples made this query so long.

2
my gut feeling is that a for loop combined with bsxfun will be a fast solution here...bla

2 Answers

1
votes

Alright stackoverflow, I figured it out! Turns out I was right about using accumarray

Matrices B,C, and A are only defined within the function for validation convenience. Matrix A would be passed as follows: best_bin_differencing_function(A)

function differenced_bins=best_bin_differencing_function()
B=[1 1 1 2 2 2 2 2 3 3]';
C=[10.4 10.6 10.3 10.2 10.6 10.8 10.8 10.7 11.0 10.8]';
A=[B,C]; 
differenced_bins=cell2mat(accumarray(A(:,1),A(:,2),[],@(x) {fun(x)}));
end

function y=fun(var)
    y=zeros(length(var),1);
    for i=1:length(var)
        y(i)=var(i)-var(1);
    end
end

I'm going to run a stress test between this and @Divakar's response and will up-vote accordingly. Thank you all for taking a look!

1
votes

Approach #1

Here's a bsxfun based approach -

%// Get the first column IDs from A and positions of the elements from
%// each ID/bin
[A_id,first_idx] = unique(A(:,1))

%// First elements from each ID/bin
first_ele = A(first_idx,2)

%// Get a 2D logical array s.t. the ones in each column represent the
%// presence of all element corresponding to each ID/bin
match_ind = bsxfun(@eq,A(:,1),A_id') %//'

%// Create the base matrix with the logical array, s.t. the ones are
%// replaced by the actual elements
base_mat = bsxfun(@times,match_ind,first_ele.') %//'

%// Final accumulate all the elements and subtract from the second column
%//of A to form the new new column for the output
out = [A A(:,2) - base_mat(match_ind)]

Approach #2

bsxfun based approach might be resource-hungry, so here's a for-loop based one that assumes sorted bins/IDs in the input data -

[A_id,first_idx] = unique(A(:,1));
[A_id,last_idx] = unique(A(:,1),'last');
out = [A A(:,2)];
for k1 = 1:numel(first_idx)
    first_id = first_idx(k1);
    last_id = last_idx(k1);
    out(first_id:last_id,3) = out(first_id:last_id,3) - out(first_id,3);
end

Approach #3

This could be an interesting approach to test out too -

[~,first_id] = max(bsxfun(@eq,A(:,1),A(:,1)')) %//'
out = [A A(:,2) - A(first_id,2)]

Approach #4

Again, assuming sorted bins/IDs, this is a diff + cumsum based approach and appears to be the fastest given the assumptions are met -

first_match = [1; diff(A(:,1))]
first_match(first_match==1) = [1 ; diff(find(first_match))]
out = [A A(:,2) - A(cumsum(first_match),2)]

Please note that if they are not sorted, you can use sortrows as shown here -

[A,sorted_ind] = sortrows(A,1)
first_match = [1; diff(A(:,1))]
first_match(first_match==1) = [1 ; diff(find(first_match))]
out(sorted_ind,:) = [A A(:,2) - A(cumsum(first_match),2)]

You can use this technique for all other approaches that assumes sorted data to make them generalized.