3
votes

There is a dataset (just for test) as follow: 0.1 0.2 0.3 0.4 0.5 1.1 1.2 1.3 1.4 1.5 0.1 0.2 0.3 0.4 0.5 I'd like to get the frequency count between the minimum 0.1 and maximum 1.5 with the bin(step size) is 0.1. I have tested in Matlab, Octave, Origin, and AWK script. However, I got completely different result.

1. Matlab

data  = [0.1 0.2 0.3 0.4 0.5 1.1 1.2 1.3 1.4 1.5 0.1 0.2 0.3 0.4 0.5];
edge  = 0.1:0.1:1.5;
count = histc(data, edge);

result is:

count = [2 4 0 2 2 0 0 0 0 0 1 1 1 1 1]

2. Octave

data  = [0.1 0.2 0.3 0.4 0.5 1.1 1.2 1.3 1.4 1.5 0.1 0.2 0.3 0.4 0.5];
edge  = 0.1:0.1:1.5;
count = histc(data, edge);

result is:

count = [2 2 2 2 2 0 0 0 0 0 1 2 0 1 1]

3. Origin

use the command "frequency count", set the min=0.1, max=1.5, step size=0.1.

result is:

count = [2 4 0 2 2 0 0 0 0 0 2 1 1 1]

4. AWK

{...;count[data/0.1]++;} ...

result is:

count = [2 4 0 2 2 0 0 0 0 0 2 0 2 0 1]

Why do I get these different results? Am I doing something wrong, or have I misunderstood the concept of "frequency count"? I don't think any of the above results are correct... Could you please tell me what should I do?

1
My Octave (3.6.2) output: 2 4 0 2 2 0 0 0 0 0 2 1 1 0 1.Paul R
another different output ... Why ?Dong
All your values are on bin edges so it's probably just down to floating point precision/rounding.Paul R
so, how to get the right result ? Any suggestions ?Dong
It depends on what you think the "right" result should be, e.g. which bin should a value of say 1.5 be assigned to if the bin interval is 0.1 ?Paul R

1 Answers

5
votes

A quick way around would be to put the edge shifted

Matlab:

data  = [0.1 0.2 0.3 0.4 0.5 1.1 1.2 1.3 1.4 1.5 0.1 0.2 0.3 0.4 0.5];
edge  = 0.05:0.1:1.55;
count = histc(data, edge)

results:

  Columns 1 through 9

     2     2     2     2     2     0     0     0     0

  Columns 10 through 16

     0     1     1     1     1     1     0

note: there is a spurious peak at the end as length(edge) = length(data)+1 .

Then as Paul R suggested, it comes down to precision and rounding. You'll have to go into each frequency count function to see how it is interpreted by each language. If i were you, I would multiply everything by 10 and make them int.

data=int8(data.*10)
edge  = 1:15;
count = histc(data, edge)

results:

  Columns 1 through 9

     2     2     2     2     2     0     0     0     0

  Columns 10 through 15

     0     1     1     1     1     1

What matters is how the human interpret it, not the machine. If you know you multiplied by 10 ^(your precision) and make them int, you don't care what the machine really does. Then if you have irrational numbers in your data and you still see errors, check how float numbers are coded (http://en.wikipedia.org/wiki/Floating_point)

Cheers.