0
votes

General question if anyone can point me in the right way if possible, what is the Best way to get incoming streaming .csv files into BigQuery (with some transformations applied using dataflow) at a large scale, using pub/sub ?.. since im thinking to use pub/ sub to handle the many multiple large raw streams of incoming .csv files

for example the approach Iā€™m thinking of is:

1.incoming raw.csv file > 2. pub/sub > 3. cloud storage > 4. cloud Function (to trigger dataflow) > 5. DataFlow (to transform) > 5. BigQuery

let me know if there are any issues with this Approach at scale Or a better alternative?

If that is a good approach, how to I get pub /sub to pickup the .csv files / and how do I construct this?

Thanks

Ben

enter image description here

1
I'm a bit confused, are the files already coming over Pub/Sub or are you getting them and then want to dump them to Pub/Sub. If the latter, what are you intending to get out of adding Pub/Sub to the architecture rather than having Dataflow process the files directly on GCS? ā€“ Ryan McDowell
the first to try and get the files coming over pub/sub, how to create a message in which pub sub can receive the .csv files and pub/sub being the entry point before going into GCS , my reason for adding pub sub to the architecture was to handle the many incoming files from the internet as a stream, unless there is a better way to handle this?, and i missed a step ā€“ BenAhm
You can store your .csv file on Google cloud storage, and push them to pub/sub (for example line by line...) Dataflow can subscribe to pub/sub topics you don't have to store them again to GCS. The result may be stored into another pub/sub and once it is written a cloud function is triggered and send the result to BigQuery. ā€“ Rim

1 Answers

3
votes

There's a couple of different ways to approach this but much of your use case can be solved using the Google-provided Dataflow templates. When using the templates, the light transformations can be done within a JavaScript UDF. This saves you from needing to maintain an entire pipeline and only writing the transformations necessary for your incoming data.

If your accepting many files input as a stream to Cloud Pub/Sub, remember that Cloud Pub/Sub has no guarantees on ordering so records from different files would likely get intermixed in the output. If you're looking to capture an entire file as is, uploading directly to GCS would be the better approach.

Using the provided templates either Cloud Pub/Sub to BigQuery or GCS to BigQuery, you could utilize a simple UDF to transform the data from CSV format to a JSON format matching the BigQuery output table schema.

For example if you had CSV records such as:

transactionDate,product,retailPrice,cost,paymentType
2018-01-08,Product1,99.99,79.99,Visa

You could write a UDF to transform that data into your output schema as such:

function transform(line) {
  var values = line.split(',');

  // Construct output and add transformations
  var obj = new Object();
  obj.transactionDate = values[0];
  obj.product = values[1];
  obj.retailPrice = values[2];
  obj.cost = values[3];
  obj.marginPct = (obj.retailPrice - obj.cost) / obj.retailPrice;
  obj.paymentType = values[4];
  var jsonString = JSON.stringify(obj);

  return jsonString;
}