I want to apply feature selection on a dataset (lung.mat) After loading the data, I computed the mean of distances between each feature with others by Jaccard measure. Then I sorted the distances descendingly in B1. And then I selected for example 25 number of all the features and saved the matrix in databs1. I want to select the features that have distance values greater than the mean of the array (B1).
close all;
clc
load lung.mat
data=lung;
[n,m]=size(data);
for i=1:m-1
for j=i+1:m
t1(i,j)=fjaccard(data(:,i),data(:,j));
b1=sum(t1)/(m-1);
end
end
[B1,indB1]=sort(b1,'descend');
databs1=data(:,indB1(1:25));
databs1=[databs1,data(:,m)]; %jaccard
save('databs1.mat');
I’ll be grateful to have your opinions about how to define this in B1, selecting values of B1 which are greater than the mean of the array B1, It means cutting the rest of smaller values than the mean of B1. I used this line,
B1(B1>mean(B1(:)))
after running, B1 still has the full number of features(column) equal to the full dataset, for example, lung.mat has 57 features and B1 by this line still has 57 columns, I considered that by this line B1 will be cut to the number of features that are greater than the mean of B1.
b1
is overwritten every loop iteration, so only the result of the last loop iteration is used. You should move that calculation out of the loops. You should also pre-allocatet1
before the loops:t1 = zeros(m-1,m)
. This will speed up your computation significantly. - Cris Luengo