0
votes

I am planning to prepare a server-less data pipeline with Google Cloud Platform. My plan is to use Dataflow/ Dataproc for batch processing data from three different sources.

My input sources are:

  1. Cloud SQL (MySQL)
  2. Cloud SQL (PostgreSQL)
  3. MongoDB

But after reading their documentation I got they don't have any input for cloud SQL or MongoDB.

Also I have checked their custom driver section but this is only for Java, but I am planning to use Python.

Is there any idea how I can ingest those 3 different sources with Data Flow/ Dataproc ?

1

1 Answers

1
votes

In your situation I think the best option is to use Dataproc. Whenever it is going to be batch processing.

This way you can use Hadoop or Spark and you can have more control over the workflow.

You can use Python code with Spark. {1}

You can do SQL queries with Spark. {2}

There is also a connector for MongoDB and Spark. {3}

And a connector for MongoDB and Hadoop. {4}

{1}: https://spark.apache.org/docs/0.9.0/python-programming-guide.html

{2}: https://spark.apache.org/docs/latest/sql-programming-guide.html

{3}: https://docs.mongodb.com/spark-connector/master/

{4}: https://docs.mongodb.com/ecosystem/tools/hadoop/