0
votes

I want to use the same function for parsing events in two different technologies: Goolge Bigquery and DataFlow. Is there a language I can do this in? If not, is google planning to support one any time soon?

Background: Some of this parsing is complex (e.g., applying custom URL extraction rules, extracting information out of the user agent) but it's not computationally expensive and doesn't involve joining the events to any other large look-up tables. Because the parsing can be complex, I want to write my parsing logic in only one language and run it wherever I need it: sometimes in BigQuery, sometimes in other environments like DataFlow. I want to avoid writing the same complex parsers/extractors in different languages because of the bugs and inconsistencies that can result from that.

I know BigQuery supports javascript UDFs. Is there a clean way to run javascript on Google Cloud DataFlow? Will BigQuery someday support UDFs in some other language?

1
What is the format of these events? JSON? Strings following some kind of pattern? It's not clear why the logic for parsing can only exist in one language. - Elliott Brossard
Good point @ElliottBrossard, I've added some additional context to the question to clear up my motivation. Let's assume the events are in the form of NGINX webserver logs, but I don't think it's too important (I'd be facing similar requirements for any format really because I'm going beyond basic parsing and applying extraction rules to enrich the events). - conradlee
I don't think it necessarily has to be complex and hard to maintain, though. Set up inputs and expected outputs for various kinds of extraction in some kind of language agnostic way, e.g. text files, and then you can ensure compatibility even if the extraction is implemented in multiple languages. Depending on what you are doing, you may be able to use RE2 (regexes) for extraction. You can use RE2 both from Java and from SQL in BigQuery. - Elliott Brossard

1 Answers

0
votes

We tend to use Java to puppet bigquery jobs and parse their resulting data, and then we also do that in dataflow as well.

Likewise, you have leeway with the amount of sql that you write vs auto-generate from the code-base, and how much you lean on bigquery vs dataflow. (we have found with our larger amounts of data, that there is a big benefit to offloading as much initial grouping/filtering into bigquery before pulling it into dataflow)