is there a known algorithm + data-structure to maintain a dynamical histogram?
Imagine I have a stream of data (x_1, w_1) , (x_2, w_2), ... where the x_t are doubles, that represent some measured variable and w_t is the associated weight.
I could just do the obvious (pseudo-python code):
x0,xN = 0, 10
numbins = 100
hist = [(x0 + i * delta , 0) for i in xrange(numbins)]
def updateHistogram(x, w):
k = lookup(x, hist) #find the adequated bin where to put x
hist[k][1] += 1
But I have some problems with that when I have a continuous stream of data. I don't have the full dataset in hands, and I have to check up the histogram in between the data gathering. And I have no expectation about:
- the ideal bin sizes for not ending up with a lot of empty bins,
- the range of the data
So I'd like to define the bins dynamically. I could do the stupid thing:
for x in data_stream:
data.append(x)
hist = make_histogram(data)
but I guess this will get slow very quickly...
If the all weights where equal one of the things I thought was storing the data in a sorted array and inserting new data in a way that kept the array sorted. This way I could have:
data = sortedarray();
for x in data_stream:
data.insert(x)
bins = [ data[int(i * data.size()/numbins)] for i in xrange(numbins)]
and the count inside each bin would be equal to data.size()/numbins for all bins.
I can't think of a way of including the weights in this though... does anyone have a suggestion? (knowledge about c++ libraries that do this would be welcomed also).
EDIT: (for the asked clarification)
The x_t are floating point numbers. To calculate the histogram I must divide the continuous range where the x's belong in a number of bins. So I'll have a sequence of numbers bin[0], bin[1], etc... so I must determine for what i does bin[i] < x < bin[i+1].
This is how you usually do a histogram when you have all the data beforehand. You'd then know the limits max(x) and min(x) and it would be easy to determine adequate bins. You could have them equally spaced between min(x) and max(x), for example.
If you don't know the range beforehand, you can't determine the bins. You could receive an x that doesn't fall in any bin. Or you could many empty bins cause you chose too big a range to create the bins.
data[x] += w
? What do you care about besides the weights? – ninjagecko