How to keep a dynamical histogram?

Question

is there a known algorithm + data-structure to maintain a dynamical histogram?

Imagine I have a stream of data (x_1, w_1) , (x_2, w_2), ... where the x_t are doubles, that represent some measured variable and w_t is the associated weight.

I could just do the obvious (pseudo-python code):

x0,xN = 0, 10
numbins = 100
hist = [(x0 + i * delta , 0) for i in xrange(numbins)]
def updateHistogram(x, w):
    k = lookup(x,  hist)    #find the adequated bin where to put x
    hist[k][1] += 1

But I have some problems with that when I have a continuous stream of data. I don't have the full dataset in hands, and I have to check up the histogram in between the data gathering. And I have no expectation about:

the ideal bin sizes for not ending up with a lot of empty bins,
the range of the data

So I'd like to define the bins dynamically. I could do the stupid thing:

 for x in data_stream:
      data.append(x)
      hist = make_histogram(data)

but I guess this will get slow very quickly...

If the all weights where equal one of the things I thought was storing the data in a sorted array and inserting new data in a way that kept the array sorted. This way I could have:

data = sortedarray();
for x in data_stream:
     data.insert(x)
     bins = [ data[int(i * data.size()/numbins)] for i in xrange(numbins)]

and the count inside each bin would be equal to data.size()/numbins for all bins.

I can't think of a way of including the weights in this though... does anyone have a suggestion? (knowledge about c++ libraries that do this would be welcomed also).

EDIT: (for the asked clarification)

The x_t are floating point numbers. To calculate the histogram I must divide the continuous range where the x's belong in a number of bins. So I'll have a sequence of numbers bin[0], bin[1], etc... so I must determine for what i does bin[i] < x < bin[i+1].

This is how you usually do a histogram when you have all the data beforehand. You'd then know the limits max(x) and min(x) and it would be easy to determine adequate bins. You could have them equally spaced between min(x) and max(x), for example.

If you don't know the range beforehand, you can't determine the bins. You could receive an x that doesn't fall in any bin. Or you could many empty bins cause you chose too big a range to create the bins.

Can you please clarify, if you only care about the weights, why you don't simply do data[x] += w? What do you care about besides the weights? — ninjagecko
x is a floating point number... for a sequence of numbers bin[0], bin[1],... I must determine for which i does bin[i] < x < bin[i+1]. It's not a discrete system. — Rafael S. Calsaverini

csgillespie csgillespie · Accepted Answer · 2011-07-30T14:05:02

How to determine the number of bins

There are a number of rules for determining the number of bins in a histogram. For your problem, I would go with Scott's choice:

bin_width = 3.5*sd*n^{-1/3}

where sd is the standard deviation and n is the number of data points. Crucially, you can use an online algorithm for calculating the standard deviation. The number of bins, k, is given by:

k = ceil((max(x) - min(x))/bin_width)

Storage the data

Suppose we have observed N data points. Then the confidence interval for the standard deviation,

Lower limit: sd*sqrt((N-1)/CHIINV((alpha/2), N-1))
Upper limit: sd*sqrt((N-1)/CHIINV(1-(alpha/2), N-1))

where CHIINV is a value from the chi-squared distribution. When N = 1000, the CI for the sd is:

(0.96*sd, 1.05*sd)

and so a 95% CI the bin-width is:

(3.5*0.96*sd*1000^{-1/3}, 3.5*1.05*sd*1000^{-1/3})
(0.336*sd, 0.3675*sd)

You can get something similar for the number of bins.

Algorithm

Store all the data until you have a good estimate of the optimal bin-width, say when the lower and upper CI for the number of bins are equal.
Create the number of bins and put data in bins.
All new data points are put into the bins, then discarded.

Comments

The Freedman–Diaconis' rule is better for choosing the number of bins, but it involves the inter-quantile range which is a bit more tricky calculate online.
Technically, the CI interval isn't correct when the data is sequential. But if you set a reasonable minimal number of data points to observe, say ~100 or 1000, you should be OK.
This assumes the data all follows the same distribution.
The number of bins depends on n^{-1/3}. If you know roughly how many points to expect, i.e. 10^5, 10^6 or 10^7, then you could create smaller bins with the expectation of changing the bin width in the future.

How to keep a dynamical histogram?

3 Answers