approximate histogram for streaming string values (card catalog algorithm?)

Question

I have a large list (or stream) of UTF-8 strings sorted lexicographically. I would like to create a histogram with approximately equal values for the counts, varying the bin width as necessary to keep the counts even. In the literature, these are sometimes called equi-height, or equi-depth histograms.

I'm not looking to do the usual word-count bar chart, I'm looking for something more like an old fashioned library card catalog where you have a set of drawers (bins), and one might hold SAM - SOLD,and the next bin SOLE-STE, while all of Y-ZZZ fits in a single bin. I want to calculate where to put the cutoffs for each bin.

Is there (A) a known algorithm for this, similar to approximate histograms for numeric values? or (B) suggestions on how to encode the strings in a way that a standard numeric histogram algorithm would work. The algorithm should not require prior knowledge of string population.

The best way I can think to do it so far is to simply wait until I have some reasonable amount of data, then form logical bins by:

number_of_strings / bin_count = number_of_strings_in_each_bin

Then, starting at 0, step forward by number_of_strings_in_each_bin to get the bin endpoints.

This has two weaknesses for my use-case. First, it requires two iterations over a potentially very large number of strings, one for the count, one to find the endpoints. More importantly, a good histogram implementation can give an estimate of where in a bin a value falls, and this would be really useful.

Thanks.

So your data is sorted to begin with? And what type of data structure is it in? Array? Linked List? Sorted Set? — k_g
it's in a persistent key value store. The only way to access it is to get an iterator and pass over the elements one at a time — L. Blanc
Hmm. This is "x = y*z" and you don't know any of the three variables in advance. The only other option, then, is to pick a fixed (max) size for your bins then, and have lots of them. And in that case you don't know in advance how many it will take. — Jongware

k_g k_g · Accepted Answer · 2015-03-09T22:43:43

If we can't make any assumptions about the data, you are going to have to make a pass to determine bin size.

This means that you have to either start with a bin size rather than bin number or live with a two-pass model. I'd just use linear interpolation to estimate positions between bins, then do a binary search from there.

Of course, if you can make some assumptions about the data, here are some that might help:

For example, you might not know the exact size, but you might know that the value will fall in some interval [a, b]. If you want at most n bins, make the bin size == a/n.

Alternatively, if you're not particular about exactly equal-sized bins, you could do it in one pass by sampling every m elements on your pass and dump it into an array, where m is something reasonable based on context.

Then, to find the bin endpoints, you'd find the element at size/n/m in your array.

approximate histogram for streaming string values (card catalog algorithm?)

2 Answers