0
votes

I have a large data set with two arrays, say x and y. The arrays have over 1 million data points in size. Is there a simple way to do a scatter plot of only 2000 of these points but have it be representative of the entire set?

I'm thinking along the lines of creating another array r ; r = max(x)*rand(2000,1) to get a random sample of the x array. Is there a way to then find where a value in r is equal to, or close to a value in x ? They wouldn't have to be in the same indexed location but just throughout the whole matrix. We could then plot the y values associated with those found x values against r

I'm just not sure how to code this. Is there a better way than doing this?

2

2 Answers

0
votes

I'm not sure how representative this procedure will be of your data, because it depends on what your data looks like, but you can certainly code up something like that. The easiest way to find the closest value is to take the min of the abs of the difference between your test vector and your desired value.

r = max(x)*rand(2000,1);
for i = 1:length(r)
    [~,z(i)] = min(abs(x-r(i)));
end
plot(x(z),y(z),'.')

Note that the [~,z(i)] in the min line means we want to store the index of the minimum value in vector z.

You might also try something like a moving average, see this video: http://blogs.mathworks.com/videos/2012/04/17/using-convolution-to-smooth-data-with-a-moving-average-in-matlab/

Or you can plot every n points, something like (I haven't tested this, so no guarantees):

n = 1000;
plot(x(1:n:end),y(1:n:end))

Or, if you know the number of points you want (again, untested):

npoints = 2000;
interval = round(length(x)/npoints);
plot(x(1:interval:end),y(1:interval:end))
0
votes

Perhaps the easiest way is to use round function and convert things to integers, then they can be compared. For example, if you want to find points that are within 0.1 of the values of r, multiply the values by 10 first, then round:

r = max(x) * round(2000,1);
rr = round(r / 0.1);
xx = round(x / 0.1);
inRR = ismember(xx, rr)
plot(x(inRR), y(inRR));

By dividing by 0.1, any values that have the same integer value are within 0.1 of each other.

ismember returns a 1 for each value of xx if that value is in rr, otherwise a 0. These can be used to select entries to plot.