0
votes

Background:

I have messages published to a Pub/Sub topic, and I'm trying to stream them into BigQuery using Dataflow, with the "Cloud Pub/Sub Topic to BigQuery" template. (Read more about streaming templates.)

Issue:

Some field names in Pub/Sub messages contain dash's (-), of which I have no control, and the Dataflow template seems to be expecting exact same field names in the destination BigQuery table. One limitation with BQ is that dash's (-) aren't allowed in field names, although underscores are allowed. Therefore I'm getting errors like this although I have a params.gsm_level column in BQ:

{
  "errors" : [ {
    "debugInfo" : "",
    "location" : "params.gsm-level",
    "message" : "no such field.",
    "reason" : "invalid"
  } ],
  "index" : 0
}

Question:

Is there anything I can do to map gsm-level in the message to gsm_level in BQ, but still using DF template without writing my own job? Thanks!

1
fyi: you should post it as an answer - Mikhail Berlyant

1 Answers

1
votes

I was able to use a Javascript UDF with a helper function to "clean up" key names:

function _transform(value){
    var objectConstructor = {}.constructor;
    var arrayConstructor = [].constructor;

    if (value.constructor === objectConstructor) {
        var out = {};
        for (var key in value) {
            if (value.hasOwnProperty(key)) {
                //console.log(key + " -> " + value[key]);
                var cleaned_key = key.replace(/-/g, '_');
                out[cleaned_key] = _transform(value[key]);
            }
        }
        return out;
    } else if (value.constructor === arrayConstructor) {
        var cleaned_arr = [];
        value.forEach(function(item, index) {
            cleaned_arr.push(_transform(item));
        });
        return cleaned_arr;
    } else {
        return value;
    }
}

There might be a cleaner way, but so far this works.