0
votes

I would like to upload some data that is currently stored in postGreSQL to Google Bigquery to see how the two tools compare.

To move data around there are many options but the most user friendly (for me) one I found thus far leverages the power of python pandas.

sql = "SELECT * FROM {}".format(input_table_name)
i = 0
for chunk in pd.read_sql_query(sql , engine, chunksize=10000):
    print("Chunk number: ",i)
    i += 1
    df.to_gbq(destination_table="my_new_dataset.test_pandas",
              project_id = "aqueduct30",
              if_exists= "append" )

however this approach is rather slow and I was wondering what options I have to speed things up. My table has 11 million rows and 100 columns.

The postGreSQL is on AWS RDS and I call python from an Amazon EC2 instance. Both are large and fast. I am currently not using multiple processors although there are 16 available.

1
We shift new data (daily partitions) daily from PG to Bigquery and for most cases we use program in golang. But for really huge tables like ~100 milions rows approx 20GB size we use bash script which wraps up export from PG into CSV format and load to BQ using bq command line tools. Because with golang program it took several hours. Script covers also conversion of table structure into data types used on BQ - postgresql.freeideas.cz/… - JosMac

1 Answers

2
votes

As alluded to by the comment from JosMac, your solution/approach simply won't scale with large datasets. As you're already running on AWS/RDS then something like the following would be better in my opinion:

  1. Export Postgres table(s) to S3
  2. Use the GCS transfer service to pull export from S3 into GCS
  3. Load directly into BigQuery from GCS (consider automating this pipeline using Cloud Functions and Dataflow)