I am sending data to PubSub from where I am trying to create a DataFlow job to put data into BigQuery. I have a column in the data for unique that I want to do HLL_COUNT.INIT Is there an equivalent method on the DataFlow side so that I can directly store the HLL version of the column in BigQuery?
2
votes
1 Answers
1
votes
No, DataFlow doesn't have support for BigQuery HLL sketches format, but it is clearly something that would be useful. I created feature request for it in DataFlow issue tracker: https://issuetracker.google.com/62153424.
Update: A BigQuery-compatible implementation of HyperLogLog++ has been open-sourced to github.com/google/zetasketch, and a design doc (docs.google.com/document/d/…) about integrating it into Apache Beam has been sent out to [email protected].