1
votes

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.

I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.

Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.

The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:

number_of_strings / bin_count = number_of_strings_in_each_bin

Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.

This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.

Thanks.

2
So your data is sorted to begin with? And what type of data structure is it in? Array? Linked List? Sorted Set? - k_g
it's in a persistent key value store. The only way to access it is to get an iterator and pass over the elements one at a time - L. Blanc
Do you know the size of the key value store before hand? - k_g
no. it will take one iteration to count them all - L. Blanc
Hmm. This is "x = y*z" and you don't know any of the three variables in advance. The only other option, then, is to pick a fixed (max) size for your bins then, and have lots of them. And in that case you don't know in advance how many it will take. - Jongware

2 Answers

1
votes

If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.

This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.

Of course, if you can make some assumptions about the data, here are some that might help:

For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.

Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.

Then, to find the bin endpoints, you'd find the element at size/n/m in your array.

0
votes

The solution I came up with addresses the lack of up-front information about the population by using reservoir sampling. Reservoir sampling lets you efficiently take a random sample of a given size, from a population of an unknown size. See Wikipedia for more details. Reservoir sampling provides a random sample regardless of whether the stream is ordered or not.

We make one pass through the data, gathering a sample. For the sample we have explicit information about the number of elements as well as their distribution.

For the histogram, I used a Guava RangeMap. I picked the endpoints of the ranges to provide an even number of results in each range (sample_size / number_of_bins). The Integer in the map merely stores the order of the ranges, from 1 to n. This allows me to estimate the proportion of records that fall within two values: If there are 100 equal sized bins, and the values fall in bin 25 and bin 75, then I can estimate that approximately 50% of the population falls between those values.

This approach has the advantage of working for any Comparable data type.