0
votes

We need to move data from Hive tables (Hadoop) to GCP (Google Cloud Platform) BigQuery at regular intervals (hourly/daily/any). There are multiple tables and the volume of data is huge. Can you please let me know if Cloud Data Flow (CDF) can be used in this situation? Any alternatives?

Thanks in advance!

Regards, Kumar

1
Where is your Hadoop cluster(s) installed and running i.e. on-premise or on GCP? Depending on your situation, we can help you build your strategy.Raunak Jhawar
Thanks for your response! Hadoop cluster is not on GCP. It cluster is on-premise.kishore k
There are umpteen number of solutions depending on the timescales you have but any or all of them will involve using GSUTIL to copy data from HDFS (to local file system) and then to GCS. Alternatively, you may also develop a solution using MySQL basckup and restoring them back on GCP.Raunak Jhawar
Thanks Raunak! Can we use Google Cloud data flow to get table from hive and store it in Google Cloud Storage?kishore k
No. This cannot be done until unless the HDFS namespace is extended and visible on to GCP. The best approach here is to copy HDFS data across environments.Raunak Jhawar

1 Answers

0
votes

There are umpteen number of solutions depending on the timescales you have but any or all of them will involve using gsutil to copy data from HDFS (to the local file system) and then to GCS. Alternatively, you may also develop a solution using MySQL backup and restoring them back on GCP