0
votes

I have large table in CloudSQL that needs to be updated every hour, and I'm considering Airflow as a potential solution. What is the best way to update a large amount of data in a CloudSQL database from Airflow?

The constrain are:

  1. The table need still be readable while the job is running

  2. The table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time

Some of the ideas I have:

  1. Load data needs to update into a pandas framework and run pd.to_sql

  2. Load data into a csv in Cloud Storage and execute LOAD DATA LOCAL INFILE

  3. Load data in memory, break it into chunks, and run a multi-thread process that each update the table chunk by chunk using a shared connection pool to prevent exhausting connection limits

3

3 Answers

1
votes

One clue for your reference based on postgresql partition table but need some DML operation define the partitioned table.

Currently, you main constrains are:

  1. the table need still be readable while the job is running

It means no lock allowed.

  1. the table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time

it should capable with multiple writing in sample time.

I add one things for you may considered as well:

  1. reasonable read performance while writing. ** performance and user experience is key

Partition table could reach all requirements. It is transparence to client applicationi.

At present, you are doing ETL, soon will facing performance issue as the table size gain quickly. The partitioned table is only solution.

The main steps are:

  1. Create partition table with partition list.

  2. normal reading and writing to the table running as usual.

  3. ETL process(could be in parallel):

    -. ETL data and uploaded to new table. (very slow, minutes to hours. but no impact to main table)

    -. Add the new table to the main table partition list. (super fast, micro seconds level to enable main table)

  4. normal main table reading and write as usual with new data.

If you like the answer, pls vote it up.

Best Regards, WY

0
votes

My recent airflow related ETL project could be a reference for you.

  • Input DB: LargeDB (billion row level Oracle)
  • Interim DB: Mediam DB( tens of million level HD5 file)
  • Output DB: Mediam DB (tens of millsion level mysql )

As far as I encountered, write to db is main block for such ETL process. so as you can see,

  • For interim stage, I use HD5 as interim DB or file for data transforming. the pandas to_hdf function provide a seconds level performance to large data. in my case, 20 millison rows write to hdf5 using less than 3 minutes. Below is the performance benchmarking for pandas IO. HDF5 format is top3 fastest and most popular format. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-perf

  • For the output stage, I use to_sql with chunk_size parameter. in order to speed up to_sql , you has to manually mapping the column type to database colume type and length,especialy the string or varchar format. With manualy mapping it, to_sql will mapp to blob format or varchar(1000). the default mode is 10 times slow than manually mapping mode. total 20millions rows write to db via to_sql(chunksize mode) spend about 20 minutes.

if you like the answer, pls vote it up

0
votes

A crucial step to consider while setting up your workflow is to always use good connection management practices to minimize your application's footprint and reduce the likelihood of exceeding Cloud SQL connection limits. Databases connections consume resources on the server and the connection application.

Cloud Composer has no limitations when it comes to your ability to interface with CloudSQL. Therefore, either of the first 2 options is good.

A Python dependency is installable if it has no external dependencies and does not conflict with Composer’s dependencies. In addition, 14262433 explicitly explains the process of setting up a "Large data" workflow using Pandas.

LOAD DATA LOCAL INFILE requires you to use --local-infile for the mysql client. To import data into Cloud SQL, make sure to follow the best practices.